New information extracted from Snowden PDFs through metadata version analysis

(libroot.org)

293 points | by libroot 16 hours ago

13 comments

layer8 11 hours ago
These PDFs apparently used the “incremental update” feature of PDF, where edits to the document are merely appended to the original file.
It’s easy to extract the earlier versions, for example with a plain text editor. Just search for lines starting with “%%EOF”, and truncate the file after that line. Voila, the resulting file is the respective earlier PDF version.
(One exception is the first %%EOF in a so-called linearized PDF, which marks a pseudo-revision that is only there for technical reasons and isn’t a valid PDF file by itself.)
[-]
- theturtletalks 7 hours ago
  New OSINT skill unlocked
  [-]
  - toomuchtodo 6 hours ago
    I see an interesting parallel to how people think about captured encrypted data, and how long that encryption needs to be effective for until technology catches up and can decrypt (by which point, hopefully the decrypted data is worthless). If all of these documents are stored in durable archives, future methodologies may arrive to extract value or intelligence not originally available at the time of capture and disclosure.
    [-]
    - theturtletalks 6 hours ago
      > If all of these documents are stored in durable archives, future methodologies may arrive to extract value or intelligence not originally available at the time of capture and disclosure.
      I recently learned that some people improve or brush up on their OSINT skills by trying to find missing people!
- ajross 11 hours ago
  It's hilarious the extent to which Adobe Systems's ridiculously futile attempt to chase MS Word features ended up being the single most productive espionage tool of the last quarter century.
  [-]
  - layer8 11 hours ago
    I don’t think this was particularly modeled on MS Word. The incremental update feature was introduced with PDF 1.2 in 1996. It allows to quickly save changes without having to rewrite the whole file, for example when annotating a PDF.
    Incremental updates are also essential for PDF signatures, since when you add a subsequent signature to a PDF, you couldn’t rewrite the file without breaking previous signatures. Hence signatures are appended as incremental updates.
    [-]
    - ajross 9 hours ago
      PDF files are for storing fixed (!!) output of printed/printable material. That's where the format's roots are via Postscript, it's where the format found its main success in document storage, and it's the metaphor everyone has in mind when using the format.
      PDFs don't change. PDFs are what they look like.
      Except they aren't, because Adobe wanted to be able to (ahem) "annotate" them, or "save changes" to them. And Adobe wanted this because they wanted to sell Acrobat to people who would otherwise be using MS Word for these purposes.
      And in so doing, Adobe broke the fundamental design paradigm of the format. And that has had (and continues to have, to hilarious effect) continuing security impact for the data that gets stored in this terrible format.
      [-]
      - detourdog 6 hours ago
        When Acrobat came out cross platform was not common. Being able to publish a document that could be opened on multiple platforms was a big advantage. I was using it to distribute technical specifications in the mid 90's. Different pages of these specifications came from, Filemaker, Excel, Word, Mini-Cad, Photoshop, Illustrator, and probably other applications as well. We would combine these into a single PDF file. This simplified version control. This also meant that bidders could not edit the specifications.
        None of that could be accomplished with Word alone. I think you are underestimating the qualities of PDF for distribution of complex documents.
        [-]
        ajross 2 hours ago
        > This also meant that bidders could not edit the specifications.
        But they can! That's the bug, PDF is a mutable file format owing to Adobe's muckery. And you made the same mistake that every government redactor and censor (up to and including the ?!@$! NSA per the linked article) has in the intervening decades.
        The file format you thought you were using was a great fit for your problem, and better than MS Word. The software Adobe shipped was, in fact, something else.
      - ogurechny 37 minutes ago
        It started in the '80s. PostScript was the big deal. It was a printer language, not a document language. It was not limited to “(mostly) text documents”, even though complex vector fonts and even hinting were introduced. For example, you could print some high quality vector graphs in native printer resolution from systems which would never ever get enough memory to rasterise such giant bitmaps, by writing/exporting to PostScript. That's where Adobe's business was. See also NeWS and NeXT.
        However, arbitrary non-trivial PostScript files were of little use to people without a hardware or software rasteriser (and sometimes fonts matching the ones the author had, and sometimes the specific brand of RIP matching the quirks of authoring software, etc.), so it was generally used by people in publishing or near it. PDF was an attempt to make a document distribution format which was more suitable to more common people and more common hardware (remember the non-workstation screen resolutions at the time). I doubt that anyone imagined typical home users writing letters and bulletins in Acrobat, of all things (though it does happen). It would be similar to buying Photoshop to resize images (and waiting for it to load each time). Therefore, competitor to Word it was not. Vice versa, Word file was never considered a format suitable for printing. The more complex the layout and embedded objects, the less likely it would render properly on publisher's system (if Microsoft Office did exist for its architecture at all). Moreover, it lacked some features which were essential for even small scale book publishing.
        Append-only or versioned-indexed chunk-based file formats for things we consider trivial plain data today were common at the time. Files could be too big to rewrite completely each time even without edits, just because of disk throughput and size limits. The system could not be able to load all of the data into memory because of addressing or size limitations (especially when we talk about illustrations in resolutions suitable for printing). Just like modern games only load the objects in player's vicinity instead of copying all of the dozens or hundreds of gigabytes into memory, document viewers had to load objects only in the area visible on screen. Change the page or zoom level, and wait until everything reloads from disk once again. Web browsers, for example, handle web pages of any length in the same fashion. I should also remind you that default editing mode in Word itself in the '90s was not set to WYSIWYG for similar performance reasons. If you look at the PDF object tree, you can see that some properties are set on the level above the data object, and that allows overwriting the small part of the index with the next version to change, say, position without ever touching the chunk in which the big data itself stays (because appending the new version of that chunk, while possible, would increase the file size much more).
        Document redraw speed can be seen in this random video. But that's 1999, and they probably got a really well performing system to record the promotional content. https://www.youtube.com/watch?v=Pv6fZnQ_ExU
        PDF is a terrible format not because of that, but because its “standard” retroactively defined everything from the point of view of Acrobat developer, and skipped all the corner cases and ramifications (because if you are an Acrobat developer, you define what is a corner case, and what is not). As a consequence, unless you are in a closed environment you control, the only practical validator for arbitrary PDFs is Acrobat (I don't think that happened by chance). The external client is always going to say “But it looks just fine on my screen”.
    - cubefox 11 hours ago
      I'm pretty sure you can change various file formats without rewriting the entire file and without using "incremental updates".
      [-]
      - layer8 10 hours ago
        You can’t insert data into the middle of a file (or remove portions from the middle of a file) without either rewriting it completely, or at least rewriting everything after the insertion point; the latter requires holding everything after the insertion point in memory (or writing it out to another file first, then reading it in and writing it out again).
        PDF is designed to not require holding the complete file in memory. (PDF viewers can display PDFs larger than available memory, as long as the currently displayed page and associated metadata fits in memory. Similar for editing.)
        [-]
        maweki 7 hours ago
        While tedious, you can do the rewrite block-wise from the insertion point and only store a an additional block's worth of the rest (or twice as much as you inserted)
        ABCDE, to insert 1 after C: store D, overwrite D with 1, store E, overwrite E with D, write E.
      - mattzito 10 hours ago
        No, if you are going to change the structure of a structured document that has been saved to disk, your options are:
        1) Rewrite the file to disk 2) Append the new data/metadata to the end of the existing file
        I suppose you could pre-pad documents with empty blocks and then go modify those in situ by binary editing the file, but that sounds like a nightmare.
        [-]
        cubefox 10 hours ago
        Aren't there file systems that support data structures which allow editing just part of the data, like linked lists?
        [-]
        PhilipRoman 8 hours ago
        Yeah there are, Linux supports parameters FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE for fallocate(2). Like most fancy filesystem features, they are not used by the vast majority of software because it has to run on any filesystem so you'd always need to maintain two implementations (and extensive test cases).
        [-]
        altfredd 3 hours ago
        They are fully supported almost everywhere. XFS, ext4, tmpfs, f2fs and a bunch of misc filesystems all support them.
        Ext4 support dates as early as Linux 3.15, released in 2014. It is ancient at this point!
        cubefox 8 hours ago
        Interesting that after decades of file system history, this is still considered a "fancy feature", considering that editing files is a pretty basic operation for a file system. Though I assume there are reasons why this hasn't become standard long ago.
        [-]
        layer8 7 hours ago
        File systems aren’t databases; they manage flat files, not structured data. You also can’t just insert/remove random amounts of bytes in RAM. The considerations here are actually quite similar, like fragmentation. If you make a hundred small edits to a file, you might end up with the file taking up ten times as much space due to fragmentation, and then you’d need the file system to do some sort of defragmentation pass to rewrite the file more contiguously again.
        In addition, it’s generally nontrivial for a program to map changes to an in-memory object structure back to surgical edits of a flat file. It’s much easier to always just serialize the whole thing, or if the file format allows it, appending the serialized changes to the file.
        [-]
        PhilipRoman 7 hours ago
        Indeed, also userspace-level atomicity is important, so you probably want to save a backup in case power goes out at an unfortunate moment. And since you already need to have a backup, might as well go for a full rewrite + rename combo.
        formerly_proven 8 hours ago
        What this does on typical extent-based file systems is split the extent of the file at the given location (which means these operations can only be done with cluster granularity) and then insert a third extent. i.e. calling INSERT_RANGE once will give you a file with at least three extents (fragments). This, plus the mkfs-options-dependent alignment requirements, makes it really quite uninteresting for broad use in a similar fashion as O_DIRECT is uninteresting.
        layer8 10 hours ago
        Look at the C file API which most software is based on, it simply doesn’t allow it. Writing at a given file position just overwrites existing content. There is no way to insert or remove bytes in the middle.
        Apart from that, file systems manage storage in larger fixed-size blocks (commonly 4 KB). One block typically links to the next block (if any) of the same file, but that’s about the extent of it.
        [-]
        anthk 3 hours ago
        DD should.
        formerly_proven 9 hours ago
        No. Well yes. On mainframes.
        This is why “table of contents at the end” is such an exceedingly common design choice.
      - bahmboo 6 hours ago
        This was 1996. A typical computer had tens of megabytes of memory with throughput a fraction of what we have today. Appending an element instead of reading, parsing, inserting and validating the entire document is a better solution in so many ways. That people doing redactions don't understand the technology is a separate problem. The context matters.
password4321 12 hours ago
The "print and scan physical papers back to a PDF of images" technique for final release is looking better and better from an information protection perspective.
[-]
- cookiengineer 12 hours ago
  > The "print and scan physical papers back to a PDF of images" technique for final release is looking better and better from an information protection perspective.
  Note that all (edit: color-/ink-) printers have "invisible to the human eye" yellow dotcodes, which contain their serial number, and in some cases even the public IP address when they've already connected to the internet (looking at you, HP and Canon).
  So I'd be careful to use a printer of any kind if you're not in control of the printer's firmware.
  There's lots of tools that started to decode the information hidden in dotcodes, in case you're interested [1] [2] [3]
  [1] https://github.com/Natounet/YellowDotDecode
  [2] https://github.com/mcandre/dotsecrets
  [3] (when I first found out about it in 2007) https://fahrplan.events.ccc.de/camp/2007/Fahrplan/events/197...
  [-]
  - culi 9 hours ago
    That's why I'm (still) waiting on this https://www.crowdsupply.com/open-tools/open-printer
    It's mindboggling how much open-source 3d printing stuff is out there (and I'm grateful for it) but this is completely lacking in the 2d printing world
    [-]
    - octoberfranklin 7 hours ago
      Sorry, you have been blocked You are unable to access crowdsupply.com Why have I been blocked? This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.
      What can I do to resolve this? You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.
      Cloudflare Ray ID: 9bbed59d7bcd9dfc • Performance & security by Cloudflare
  - emptybits 10 hours ago
    Thanks for the links but can you share evidence for the "public IP address" claim? Each time I've read this concept (intriguing! possible!), I search for evidence and I can't find any.
    The MIC and yellow dots have been studied and decoded by many and all I've ever seen, including at your links, are essentially date + time + serial#.
    Don't get me wrong ... stamping our documents with a fingerprint back to our printers and adding date and time is nasty enough. I don't see a need to overstate the scope of what is shared though.
    [-]
  - everdrive 11 hours ago
    >Note that all printers have "invisible to the human eye" yellow dotcodes, which contain their serial number, and in some cases even the public IP address when they've already connected to the internet (looking at you, HP and Canon).
    I've got a black and white brother printer which uses toner. Is there something similar for this printer?
    [-]
    - nwallin 10 hours ago
      It's only there for color printers.
      A tiny yellow dot on white paper is basically invisible to the human eye. Yellow ink absorbs blue light and no other light, and human vision is crap at resolving blue details.
      A tiny black dot on white paper sticks out like a sore thumb.
    - gramie 11 hours ago
      I believe that this only exists for colour printers. The official reasoning was to trace people counterfeiting money.
    - cookiengineer 11 hours ago
      > black and white brother printer
      excellent choice, that's what I am using. Also it's Linux / CUPS compatible and without a broken proprietary rasterizer.
      [-]
      - contingencies 11 hours ago
        Same thing here. A few years ago I bought three brands of printer-scanner combos for our R&D office, returned the others. Brother was the least broken despite still not being perfect. Issues include broken scanning drivers and fake toner warnings at ~1/3 level.
    - IshKebab 9 hours ago
      Yes, the data can be embedded by modulating the laser.
      But I've only seen research showing that it's possible. As far as I know nobody has demonstrated whether actual laser printers use that technique or not.
  - mmh0000 10 hours ago
    If you have a UV flashlight, these dots are visible with decent vision.
    And of course we have to include the Wikipedia entry:
    https://en.wikipedia.org/wiki/Printer_tracking_dots
    [-]
    - culi 9 hours ago
      And also EFF's attempt to track all printers that do or do not display tracking dots which they eventually prepended with
      > (Added 2015) Some of the documents that we previously received through FOIA suggested that all major manufacturers of color laser printers entered a secret agreement with governments to ensure that the output of those printers is forensically traceable.
      > This list is no longer being updated.
      https://www.eff.org/pages/list-printers-which-do-or-do-not-d...
  - askvictor 6 hours ago
    Could this be circumvented by randomly (or not-so-randomly) adding single-pixel yellow dots to the data sent to the printer?
- notepad0x90 8 hours ago
  a better approach is to convert them to jpeg/png. Then convert that to raw BMP, and then share or print that.
  A more modern approach for text documents would be to have an LLM read and rephrase, and restructure everything without preserving punctuation and spacing, using a simple encoding like utf-8, and then use the technique above or just take analog pictures of the monitor. The analog (film) part protects against deepfakes and serves as proof if you need it (for the source and final product alike).
  There various solutions out there after the leaks that keep happening where documents and confidential information is served/staged in a way that will reveal the person with who it is shared. Even if you copy paste the text into notepad and save it in ascii format, it will reveal you. Off-the-shelf printers are of course a big no-no.
  If all else fails, that analog picture technique works best for exfil, but the final thing you share will still track back to you. I bet spies are back to using microfilms these days.
  I only say all of that purely out of a fascination into the subject and for the sake of discussion (think like a thief if you want to catch one and all). Ultimately, you shouldn't share private information with unauthorized parties, period. Personal or otherwise. If you, like snowden, feel that all lawful means are exhausted and that is your only option to address some grievance, then don't assume any technique or planning will protect you, if it isn't worth the risk of imprisonment, then you shouldn't be doing it anyways. Assume you will be imprisoned or worse.
- emeril 11 hours ago
  I suppose I'd just save the pdf to tiff/png then remake back into a pdf from there to avoid printing and scanning?
  if really paranoid, I suppose one could run a filter on the image files to make them a bit fuzzy/noisy
  [-]
  - maximilianthe1 9 hours ago
    I think "Print to PDF" would be easiest
    [-]
    - bawolff 7 hours ago
      I'd worry print to pdf might be ineffective. I think rasterizing is the way to go.
- tester756 11 hours ago
  Why not just make screenshoot of every PDF page?
  [-]
  - jeffbee 11 hours ago
    It could still be identifiable, for example if the document has been prepared such that the intended recipient's identity is encoded into subtle modulation of the widths of spaces.
    [-]
    - yyyk 10 hours ago
      That's outside this threat model? The idea here is trying to foil outside analysis, not limit the document authors (which are allowed to add/update and even write openly 'the intended recipient's identity').
    - sincerely 11 hours ago
      Print and re-scan wouldn’t fix that though.
      [-]
      - jeffbee 11 hours ago
        That was my point. If you want to erase its origin you need to semantically extract the contents and reduce them to their most basic representation.
    - tester756 11 hours ago
      Sure, but all those not-essential information hidden in PDFs format are removed
    - idiotsecant 7 hours ago
      In PDF file format?
- iAMkenough 9 hours ago
  That'd be fun to make Section 508 compliant at mass scale.
- JumpinJack_Cash 11 hours ago
  Is there a multifunction B&W printer which prints and then automatically positions the paper on the scanner and scans?
  [-]
  - dredmorbius 10 hours ago
    Far more straightforward to print a stack, then feed that stack through the copier/scanner.
    [-]
    - IshKebab 9 hours ago
      You don't need to actually print and scan. Just convert to a raster format like PNG.
      [-]
      - dredmorbius 7 hours ago
        Given issues with fully-electronic conversion, passing through a paper phase tends to guard against foul-ups. It's tangible and demonstrable. People ... take short-cuts, which is why we're having this discussion.
      - iAMkenough 9 hours ago
        Then use OCR to convert it back from raster for Section 508 compliance. All the existing work to make handwritten pages and visuals compliant would have to be redone after converting to raster.
alhirzel 14 hours ago
There needs to be better tooling for inspecting PDF documents. Right now, my needs are met by using `qpdf` to export QDF [1], but it is just begging for a GUI to wrap around it...
[1] https://qpdf.readthedocs.io/en/stable/qdf.html
[-]
- piffey 8 hours ago
  Take a look at the REMNux reverse engineering page for PDF documents (https://docs.remnux.org/discover-the-tools/analyze+documents...). Lots of tools here for looking at malicious PDFs that can be used to inspect/understand even non-malicious documents.
- soared 11 hours ago
  In what contest do you use that tool? Looks like that page is primarily about editing pdfs using that format rather than inspecting.
  Very tempting to fool around with the ideas especially after the Epstein pdf debacle.
  [-]
Ms-J 13 hours ago
This is insightful work, great job.
Recently someone else revisited the Snowden documents and also found more info, but I can't recall the exact details.
Snowden and the archives were absolute gifts to us all. It's a shame he didn't release everything in full though.
[-]
- libroot 12 hours ago
  Thank you. The most recent completely new information from the Snowden files is found in Jacob Appelbaum's 2022 thesis[1], in which he revealed information that had not been previously public (not found on any previously published documents and so on). And AFAIK, the most recent new information from the published documents (along with this post) might actually be in our other posts[2], but there might be some others we aren't aware of.
  [1]: https://www.electrospaces.net/2023/09/some-new-snippets-from...
  [2]: Part 2: https://libroot.org/posts/going-through-snowden-documents-pa...
  and part 3: https://libroot.org/posts/going-through-snowden-documents-pa...
- dfreel 13 hours ago
  [flagged]
  [-]
  - SAI_Peregrinus 12 hours ago
    Snowden never had Russia as a destination, the US revoked his passport while he was waiting in a layover. He was stuck in the airport for months. How is it "telling" of anything?
  - Andrex 12 hours ago
    The best way to fix a problem is to bring it into the light, not pretend it doesn't exist. "Security by obscurity" has been debunked for decades.
    If our system is so flawed Snowden's leaks would have blown everything up, maybe the system deserves to be blown up.
    Otherwise we're just papering over flaws which likely will be discovered and exploited eventually.
    [-]
    - dfreel 12 hours ago
      [flagged]
      [-]
      - threethirtytwo 12 hours ago
        That’s a one sided view. Secrecy is also used to hide corruption and crimes. There is plenty of corruption going on just within the CIA.
      - Andrex 11 hours ago
        "Snowden's aim was to damage the US and its allies, and he succeeded in this."
        I doubt it but if you have a source I'll check it out. Third party speculation doesn't count, obviously.
      - sroussey 11 hours ago
        I call BS. The Snowden files completely rewrote the rules of security inside Google and everywhere and led to zero-trust. These companies are now protected against this unlawful hacking of the government on their local companies and thus also better protected from governments around the world. Ironically, the leaks made the US more secure.
      - wizzwizz4 12 hours ago
        Are you thinking of Julian Assange? I'm not aware of Edward Snowden releasing anything that might have resulted in the deaths of field agents.
      - plagiarist 11 hours ago
        There is a non-zero number of documents that are classified as Top Secret not for national security but because corrupt shitheads are in control of classifying documents.
      - maqp 11 hours ago
        >Snowden's aim was to damage the US and its allies, and he succeeded in this.
        Dude, nobody's buying this nonsense. Snowden expressed his concerns multiple times. He talked about the surveillance enabling turn-key tyranny, if ever a fascist leader would rise into power in the US. And look what's happening now. He was right, and thank god he blew the whistle, as that gave privacy activists a decade long headstart to get end-to-end encryption deployed.
      - CamperBob2 10 hours ago
        Given that our incompetent security policies apparently granted full access to people with no conceivable need to know (see also Manning), the bad guys already had all that stuff. If that wasn't the case before, it certainly is now, with Trump in office.
        Law-abiding US citizens are pretty much the only ones who didn't know what was being done in their names. That's the only thing the Snowden disclosures changed.
  - jibal 12 hours ago
    Your comment is indeed very telling. He ended up in Russia because the U.S. revoked his visa while en route to Ecuador so he was forced to live in a Russian airport for 6 weeks.
    [-]
    - dfreel 12 hours ago
      [flagged]
      [-]
      - paulryanrogers 11 hours ago
        > One can join the dots and have a good idea of exactly what was the purpose of doing this.
        The purpose was to avoid rendition to and torture within black sites of ambiguous jurisdiction.
        If classified material is so precious then why not lock up the guy who showed off and stored stolen classified material in his golf course bathroom?
      - fc417fc802 11 hours ago
        The obvious purpose would be to avoid the authorities of the US and our allies, no? If you broke US federal law and hoped to avoid being detained, would you choose to travel to a NATO country or a non-NATO one? This isn't rocket science.
      - eesmith 11 hours ago
        He deliberately planned to travel via countries which were unlikely to extradite him to the US on his way to a country which offered him permanent asylum.
        Do you have a suggestion for a better routing? I surely can't. How should he have gotten to Ecuador? (Which, btw, is not a US adversary.)
        As for your "drop off" conjecture, we have no evidence that happened, and unless you are attached to the "Snowden was black-hearted liar" fabrication, we can all read https://en.wikipedia.org/wiki/Edward_Snowden where he says he did not do that, and explains why:
        > In October 2013, Snowden said that before flying to Moscow, he gave all the classified documents he had obtained to journalists he met in Hong Kong and kept no copies for himself.[110] In January 2014, he told a German TV interviewer that he gave all of his information to American journalists reporting on American issues.[57] During his first American TV interview, in May 2014, Snowden said he had protected himself from Russian leverage by destroying the material he had been holding before landing in Moscow.
        I take it you believe he lied, and during the last decade-plus his nefarious actions and additional secret files never leaked.
        Would you care to explain the basis for your belief?
  - maqp 12 hours ago
    >It is of course very telling that Snowden ended up in Russia.
    Yeah it's almost like you can revoke someone's passport during their layover in Russia and make the people with MAGA-levels of intelligence take the optics at face value through decade long repeated messaging.
    If Snowden was a Russian spy, he would've taken the files, given them to Putin, received the largest Datša in the country and we would never have heard from him or the files. Instead, he gave it to journalists who made the call what to release.
    If you don't want people to blow the whistle, stop breaking the damn law https://www.theguardian.com/us-news/2020/sep/03/edward-snowd...
    [-]
    - dfreel 12 hours ago
      Very naive to think that the Russian and Chinese governments didn't get a full copy of the documents Snowden stole and absconded with.
      How have so many people been taken in by this tall tale that Snowden is some sort of hero? Gullible doesn't even begin to cover it.
      [-]
      - Zigurd 12 hours ago
        Even if you are playing tic-tac-toe at a chess tournament, you still have to think a move ahead. Saying "Very naive to think that the Russian and Chinese governments didn't get a full copy" makes your initial point moot. If the adversaries you are supposedly worried about already have everything, what's the point of keeping it from the American people?
      - jibal 12 hours ago
        You're changing the subject from how you misrepresented why Snowden ended up in Russia.
        I won't engage further with someone acting in bad faith.
        [-]
        dfreel 12 hours ago
        You replied to the wrong comment.
        Anyway, the weakening of the national security apparatus likely contributed to why we now have Trump as head of state, and all the harm he's doing to national interests. So you can thank Snowden for that too.
        [-]
        plagiarist 11 hours ago
        How? Did Snowden interrupt something the NSA was planning to about Facebook and Fox News running far-right propaganda?
        [-]
        lern_too_spel 1 minute ago
        Snowden's and Greenwald's illiterate misinterpretation of PRISM certainly hurt the Democrats.
      - nebezb 6 hours ago
        > Very naive to think that the Russian and Chinese governments didn't get a full copy of the documents Snowden stole and absconded with.
        Does one need to be gullible to believe this? Or will you substantiate your extraordinary claim?
      - anticodon 12 hours ago
        What Russia and China has in common? Why would somebody work for both countries?
        Do you know, for example, that China willingly sells huge amounts of drones to Ukraine?
        [-]
        jfengel 11 hours ago
        Here in 2026 China is supplying Russia with weapons in exchange for oil under global sanctions.
        Russia is in no position to reject China for selling to both sides. They may not be allies but each is the enemy of their enemies.
        [-]
        anticodon 10 hours ago
        Which exactly weapons are supplied by China? Even ever lying news sources like Bloomberg and CNN never made such unfounded accusations.
        Also, how and why some "spy" would work both for China and Russia? Two very different countries from every point of view: culturally, economically, and in every other way also.
        The only thing in common is that USA wants to destroy both Russia and China and that because of that reason US controlled media (like 90% of media in the world) publish scary fakes about both countries.
      - maqp 12 hours ago
        >Very naive to think that the Russian and Chinese governments didn't get a full copy of the documents Snowden stole and absconded with.
        Extraordinary claims require extraordinary evidence.
        >Snowden is some sort of hero?
        He left his cozy upper middle-class life, partner, and put his life on the line to expose illegal mass surveillance. That's gazillion times more risk and sacrifice to do the right thing, than you'll ever accomplish.
        [-]
        jfengel 11 hours ago
        The problem is that much of it wasn't illegal. Some was, but some was just spy agencies doing spy agency things. The laws draw some pretty fine distinctions that are at odds with what you expect.
        Perhaps it's worth it to have exposed the genuinely illegal things and have scrutiny on the legal but unpleasant ones. But I don't think that's obviously the case. Spy agencies are by definition going to do stuff you wouldn't approve of if you weren't paying close attention to what protections are in place.
  - zoklet-enjoyer 12 hours ago
    Spy on me harder, daddy
  - verelo 12 hours ago
    Wow the Reddit bots made it to HN. We must be famous now.
c-c-c-c-c 11 hours ago
> We contacted Ryan Gallagher, the journalist who led both investigations, to ask about the editorial decision to remove these sections. After more than a week, we have not received a response.
Hopefully we'll hear something now that the Christmas holidays are over.
[-]
- echelon 10 hours ago
  Why are the journalists redacting the docs? That's incredibly puzzling.
  Is there something in here so damaging that they refuse to publish it?
  Did the government tell them they'd be in trouble if they published it?
  Are the journalists the only ones with access to the raw files?
  [-]
  - nacozarina 9 hours ago
    Traditionally an editor would be obligated to review the material and redact info that could be harmful to others. The publisher has distinct liability independent of govt opinion.
    [-]
    - mlmonkey 2 hours ago
      > and redact info that could be harmful to others.
      of course, these concerns are only applicable when these "others" are Americans and the American institutions.
      Everybody else can just fend for themselves.
      Whats good for the goose, should be good for the gander. If American journalists feel like there is no problem with disclosing secrets of, say, Maduro, then they should not be protecting people like Trump (just as an example).
pfisherman 14 hours ago
Can someone spell out how this is possible? Do pdfs store a complete document version history? Do they store diffs in the metadata? Does this happen each time the document is edited?
[-]
- aidos 14 hours ago
  You can replace objects in PDF documents. A PDF is mostly just a bunch of objects of different types so the readers know what to do with them. Each object has a numbered ID. I recommend mutool for decompressing the PDF so you can read it in a text editor:
```
    mutool clean -d in.pdf out.pdf
```
  If you look below you can see a Pages list (1 0 obj) that references (2 0 R) a Page (2 0 obj).
```
    1 0 obj
    <<
      /Type /Pages
      /Count 1
      /Kids [ 2 0 R ]
    >>
    endobj

    2 0 obj
    <<
      /Type /Page
      /Contents 5 0 R
      ...
    >>
    endobj
```
  Rather than editing the PDFs in place, it's possible to update these objects to overwrite them by appending a new "generation" of an object. Notice the 0 has been incremented to a 1 here. This allows leaving the original PDF intact while making edits.
```
    1 1 obj
    <<
      /Type /Pages
      /Count 2
      /Kids [ 2 0 R 200 0 R ]
    >>
    endobj
```
  You can have anything inside a PDF that you want really and it could be orphaned so a PDF reader never picks up on it. There's nothing to say an object needs to be referenced (oh, there's a "trailer" at the end of the PDF that says where the Root node is, so they know where to start).
  [-]
  - pfisherman 12 hours ago
    Thanks for the technical explanation! This is pretty fascinating.
    So it works kind of like a soft delete — dereference instead of scrubbing the bits.
    Is this behavior generally explicitly defined in PDF editors (i.e. an intended feature)? Is it defined in some standard or set of best practices? Or is it a hack (or half baked feature) someone implemented years ago that has just kind of stuck around and propagated?
    [-]
    - clord 11 hours ago
      The intention is to make editing easy and quick on slow and memory deficient computers. This is how for example editing a pdf with form field values can be so fast. It’s just appending new values for those nodes. If you need to omit edits you’d have to regenerate a fresh pdf from the root.
  - SeriousM 13 hours ago
    To put it reaaaaaly simple, a PDF is like a notion document (blocks and bricks) with a git-like object graph?
    [-]
    - aidos 12 hours ago
      Ha! As if anything about Notion is simple.
      But yeah. It's all just objects pointing at each other. It's mostly tree structured, but not entirely. You have a Catalog of Pages that have Resources, like Fonts (that are likely to be shared by multiple pages hence, not a tree). Each Page has Contents that are a stream of drawing instructions.
      This gives you a sense of what it all looks like. The contents of a page is a stack based vector drawing system. Squint a little (or stick it through an LLM) and you'll see Tf switches to Font F4 from the resources at size 14.66, Tj is placing a char at a position etc.
      2 0 obj << /Type /Page /Resources << /Font << /F4 4 0 R >> >> /Contents 5 0 R >> endobj 5 0 obj << /Length 340 >> stream q BT /F4 14.66 Tf 1 0 0 -1 0 .47981739 Tm 0 -13.2773438 Td <002B> Tj 10.5842743 0 Td <004C> Tj ET Q... endstream endobj
      I'm going to hand wave away the 100+ different types of objects. But at it's core it's a simple model.
- flotzam 14 hours ago
  At the bottom of the page there's a link to the pdfresurrect package, whose description says
  "The PDF format allows for previous changes to be retained in a revised version of the document, thereby keeping a running history of revisions to the document.
  This tool extracts all previous revisions while also producing a summary of changes between revisions."
  [-]
  - QuantumNomad_ 8 hours ago
    Neat!
    https://github.com/enferex/pdfresurrect
- 1317 11 hours ago
  https://hackerfactor.com/blog/index.php?/archives/1085-A-Typ...
- alhirzel 14 hours ago
  PDFs are just a table of objects and tree of references to those objects; probably, prior versions of the document were expressed in objects with no references or something like that.
pseudosavant 8 hours ago
In addition to the print paper and scan approach, I do wonder how effective it would be to “Print to XPS” and then “print” that into a PDF.
bawolff 7 hours ago
Its crazy this is just being discovered now.
[-]
- WiSaGaN 2 hours ago
  I think it's likely someone already discovered this. It's just that info is not broadcasted to people who want to comment on this thread.
- chatmasta 7 hours ago
  I wonder if it’s because of all the attention on the Epstein PDF files.
treetalker 14 hours ago
% pdfresurrect -w epsteinfiles.pdf
[-]
- cypherpunks01 12 hours ago
  Anyone tried this?
  [-]
  - snek_case 10 hours ago
    Weekend project?
  - huflungdung 11 hours ago
    [dead]
londons_explore 12 hours ago
So this is almost certainly redaction by the journalists?
It is disappointing they didn't mark those sections "redacted", with an explanation of why.
It is also disappointing they didn't have enough technical knowhow to at least take a screenshot and publish that rather than the original PDF which presumably still contains all kinds of info in the metadata.
[-]
- libroot 12 hours ago
  Yes, the journalists did the redactions. The metadata timestamps in one of the documents show that the versions were created three weeks before the publication.
  And to be honest, the journalists generally have done a great work on pretty much in all the other published PDFs. We've went through hundreds and hundreds of the published documents, and these two documents were pretty much the only ones which had metadata leak by a mistake revealing something significant (there are other documents as well with metadata leaks/failed redactions, but nothing huge). Our next part will be a technical deep-dive on PDF forensic/metadata analysis we've done.
  [-]
  - DANmode 9 hours ago
    Great work, great comment.
    Thank you.
farceSpherule 4 hours ago
[dead]
jokoon 13 hours ago
I have read claims that there were fake documents inserted in those leaks, who aimed at pushing disinformation.
[-]
- nkrisc 12 hours ago
  That itself would be a very convenient lie if the disclosures were damaging or embarrassing.
- firefax 13 hours ago
  Maybe you should include a source, especially if you're making claims about alleged "disinformation"? :-)
tolerance 10 hours ago
How much of this research and review is hands-on and how much of it is—ahem—machine assisted?
[-]
- rendx 8 hours ago
  Are you asking how much was done with pen and paper, and how much of it was done on a computer, i.e. machine assisted? Where do you draw the line? How is "hands-on" in contrast to anything? Is it only "hands-on" when you don't use any tool to assist you?
  I suspect you're inquiring about the use of LLMs, and about that I wonder: Why does it matter? Why are you asking?
  [-]
  - tolerance 6 hours ago
    First thanks for taking my question seriously and not as just a rib and asking a lot of questions in return that I want to consider myself.
    By "hands-on" I'm asking whether the provided insight is the product of human intellection. Experienced, capable and qualified. Or at least an earnest attempt at thinking about something and explaining the discoveries in the ways that thinking was done before ChatGPT. For some reason I find myself using phrases involving the hands (etc. hands-on, handmade, hand-spun) as a metaphor for work done without the use of LLMs.
    I emphasize insight because I feel like the series of work on the Snowden documents by libroot is wanting in that. I expressed as much the last time their writing hit the front page: <https://news.ycombinator.com/item?id=46236672>.
    These are summaries. I don't think that it yields information that can't otherwise be pointed out and made mention of by others; presumably known and reputable. With as high-profile of an event that this is I'd expect someone covering it almost 16 years later to tell us beyond what when judged on the merit of its import amounts to a motivated section of the ‘Snowden disclosures’ Wikipedia entry.
    The discussion that this series invites typically is centered around people's thoughts about the story of the Snowden documents in general, and in this case exchanges about technical aspects like how PDF documents work and can be manipulated in general. The one comment that I feel addresses the actual tension embedded in the article—"Who edited the documents?"—leads to accusations that the documents were tampered with by the media: <https://news.ycombinator.com/item?id=46566372>. I don't think that that's an implausible claim but I find issue with it being made with such confidence by the anonymous source behind the investigations (I'm withholding ironically putting "investigations" in...nevermind).
    If the author actually provided something that advanced to the reader why this information is significant, what to do with or think about it and how they came about discovering the answers to the aforementioned 'why' and ‘what’ and additionally why they’re word ought to matter to us at all, I'd be less inclined to speculate that this is just someone vibe sleuthing their way through documents that on the surface are only significant to the public as the claim "the government is spying on you" is.
    This particular post uncovers some nice information. It's a great find. I'm in no position to investigate whether it was already known. But what are we supposed to learn from it aside from "one of the documents were changed before it was made public". What's significant about the redaction? Is Ryan Gallagher responsible? Or does he know who is. Is he at all obliged to explain this to a presumably anonymous inquirer? Or is it now the duty of the public to expect an explanation as affected by said anonymous inquirer?
    Remember when believing that the government was rife with pedophiles automatically associated you with horn-helmet-wearing insurrectionists?
  - Ylpertnodi 8 hours ago
    [flagged]
    [-]
    - rendx 8 hours ago
      Are you confusing me with the authors, or why would you think I could? And I'm asking 'tolerance' to clarify their question, which means I wouldn't be able to answer it even if I had the knowledge they were after, since I don't understand what they're asking.