14 comments

  • mlissner 4 hours ago
    Cool to see this here. It’s funny because we do so many huge, complex, multiyear projects at Free Law Project, but this is the most viral any of our work has ever gone!

    Anyway, I made X-ray to analyze the millions of documents we have in CourtListener so that we can try to educate people about the issue.

    The analysis was fun. We used S3 batch jobs to analyze millions of documents in a matter of minutes, but we haven’t done the hard part of looking at the results and reporting them out. One day.

    • thangalin 3 hours ago
      https://www.argeliuslabs.com/deep-research-on-pdf-redaction-...

      > Information Leaking from Redaction Marks: Even when content is properly removed, the redaction marks themselves can leak some information if not done carefully. For example, if you have a black box exactly covering a word, the length of that black box gives a clue to the word’s length (and potentially its identity).

      Does X-ray employ glyph spacing attacks and try to exploit font metric leaks?

      • mlissner 2 hours ago
        No, we worked with researchers that developed that kind of system, but didn't broadcast our work b/c the research was too sensitive. Seems the cat is out the bag now though.

        I think the combination of AI and font-metrics is going to be wild though. You ought to be able to make a system that can figure out likely words based on the unredacted ones and the redaction's size. I haven't seen any redaction system yet that protects against this.

        • thangalin 2 hours ago
          > I haven't seen any redaction system yet that protects against this.

          The linked article suggests widening redacted areas more than needed with some randomization applied to the width. Strikes me that that wouldn't do much except add a few more possible solutions.

          • vlovich123 2 hours ago
            Yeah, the more robust protection is to widen to a constant. But in the general case that could require reflowing the pdf. But honestly single word redactions are really probably useless with cheap AI that can highly accurately fill in the gaps
            • rgmerk 1 hour ago
              Depends what you're trying to hide.

              If the redaction is a person's name, and there's nothing else to give the person's identity away, single word redaction probably works reasonably well, AI or no AI.

        • vlovich123 2 hours ago
          I thought glyph spacing attacks are an old idea; like I recall reading about such ideas 10-20 years ago unless I’m misremembering. Can you clarify why it was considered “too sensitive” if the whole point of this effort is to showcase these attacks?
  • embedding-shape 3 hours ago
    I haven't gone through more than just 10% of the files released today, but noticed that at least EFTA00037069.pdf for example has a `/Prev` pointer, meaning the previous revision of the file is available inside of the PDF itself. In this case, the difference is minor (stuff moved around), but I'm guessing if it's in one file, it could be more. You can run `qpdf --show-object=trailer EFTA00037069.pdf` on a PDF file to see for yourself if it's there.

    I'm almost fully convinced that someone did this bad intentionally, together with the bad redactions, as surely people tasked with redacting a bunch of files receive some instructions on what to do/not to do?

    • xhevahir 1 hour ago
      > as surely people tasked with redacting a bunch of files receive some instructions on what to do/not to do?

      You've phrased this as a question; I gather that you know better than to assume a modicum of competence from these people.

    • throwawaysleep 29 minutes ago
      All the reporting I have read suggests that they are roping anyone and everyone they can into doing redactions. So I suspect many simply lack the experience to do it well.
      • embedding-shape 22 minutes ago
        Ok, so say someone says "We're overloaded, we need more people" so someone else says "Ok, department Q, R and T changes priority to doing redaction" then at least one person somewhere in this chain has to at least consider that every person from Q, R and T must go through at least a 3 slide powerpoint or whatever saying what's happening, this is what to do, this is what to not do, right?
    • dcollect 1 hour ago
      went through them all nothing of note misdirection speculation fuel is all this is

      JaneDoe2 is redacted 150 times (so far)

      for example

  • jmward01 2 hours ago
    Hmmm.. The more I think about this the more any font kerning is likely a major leak for redaction. Even if the boxes have randomness applied to them, the words around a blacked out area have exact positioning that constrains the text within so that only certain letter/space combinations could fit between them. With a little knowledge of the rendering algorithm and some educated guessing about the text a bruit force search may be able to do a very credible job of discovering the actual text. This isn't my field. Anyone out there that has actually worked on this problem?
    • worewood 1 hour ago
      There was a recent vulnerability, where researchers were able to extract information from an encrypted chat session from an LLM, by analyzing packet size/timings of the underlying SSL connection. A classic side-channel attack. Seems possible to draw a parallel between the two.
    • mlissner 2 hours ago
      Really depends on the length and predictability of the redaction, but yes. If it's short and contextually it's only likely to be either "yes" or "no", you've got it. If it's longer and could contain an unknown person's name along with some other words, well, that's harder.
      • jmward01 26 minutes ago
        I feel like this creates a hash value and the real question is how unique of a value does it represent and how easy it is to narrow it down given throwing a dictionary at it. Similarly, unknown names could likely be teased out like a one-time pad. If they appear in multiple sentences then their randomness quickly repeats and becomes something that potentially could be isolated from the rest of the words around them. This would probably be a fun problem for a cryptography class to work on.
  • blitz_skull 1 hour ago
    Explain like I’m stupid: what is the most gracious interpretation of redaction when releasing files like this?

    Why should anyone involved retain any anonymity?

    I’m asking in good faith because naively it seems like this should not even exist. All of it should be exposed.

    EDIT: I did not think about the innocent folks that might be caught in the crossfire. That checks out. Thanks everyone!

    • OsrsNeedsf2P 1 hour ago
      Iirc WikiLeaks took the position of any information that would directly lead to the bodily harm of an individual (or something to that effect). The rational being, "Yes, group A did something horrible that warrants investigation, but if we publish their GPS coordinates they will be blown to smitherines"
    • krapp 1 hour ago
      Protecting the identity of victims, eyewitnesses or informants.
    • empath75 1 hour ago
      The files of a high profile and long running investigation are going to be full of false leads, hoaxes and other bullshit. The reason they don’t just always release the files after closing cases is that there genuinely are going to he innocent people caught in the crossfire who have privacy rights.

      This case is so important and such a clusterfuck that the files need to be opened anyway.

  • brotchie 3 hours ago
    You'd think the go-to workflow for releasing redacted PDFs would be to draw black rectangles and then rasterize to image-only PDFs :shrug:
    • selinkocalar 3 hours ago
      As someone who's built an entire business on "anti-screenshots" this is brilliant.

      PDF redaction fails are everywhere and it's usually because people don't understand that covering text with a black box doesn't actually remove the underlying data.

      I see this constantly in compliance. People think they're protecting sensitive info but the original text is still there in the PDF structure.

    • shbooms 3 hours ago
      often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option
      • pottertheotter 1 hour ago
        This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!
      • 8note 3 hours ago
        run some ocr on them after to recreate the text layer?
  • unstatusthequo 9 minutes ago
    It’s a bit amusing seeing ediscovery principles go mainstream.
  • unfocused 4 hours ago
    Adobe Pro, when used properly, will redact anything in a PDF permanently.

    Whoever did these "bad" redactions doesn't even know how to use a PDF Editor.

    We have paralegals and lawyers "mark for redaction", then review the documents, then "apply redactions". It's literally be done by thousands of lawyers/paralegals for decades. This is just someone not following the process and procedure, and making mistakes. It's actually quite amateurish. You should never, ever screw up redactions if you follow the proper process. Good on the X-ray project on trying to find errors.

    I just want to add, applying black highlights on top of text is in fact, the "old" way of redaction, as it was common to do this, and then simply print the paper with the black bars, and send the paper as the final product.

    Whoever did it is probably old, and may have done it thinking they were going to print it on paper afterwards!! Just guessing as to why someone would do this.

    • tgsovlerkhgsel 4 hours ago
      Or they may not understand how PDF works and think that it's the same as paper.

      Especially with the "draw a black box over it" method, the text also stops being trivially mouse-selectable (even if CTRL+A might still work).

      Another possibility is, of course, that whoever was responsible for this knew exactly what they were doing, but this way they can claim a honest mistake rather than intentionally leaking the data.

      • aidos 3 hours ago
        A while back I did a little work with a company that were meant to help us improve our security posture. I terminated the contract after they sent me documents in which they’d redacted their own AWS keys using this method.
      • zahlman 4 hours ago
        > Or they may not understand how PDF works and think that it's the same as paper.

        Yes; that's presumably included in being "amateurish" and "not following proper process".

    • selectodude 2 hours ago
      Any attorney or law enforcement that works for the US Federal Government receives very, very comprehensive instructions on how to redact information on basically the first day of training. There is absolutely zero doubt among any of my DOGE'd friends that this was 100 percent on purpose malicious compliance.
  • shrubble 2 hours ago
    Shockingly, you can see redaction info from within your browser's PDF viewer. I am using Brave on Linux, and went here:

    https://www.justice.gov/multimedia/Court%20Records/Matter%20...

    As a test, select with your mouse the entire first line of paragraph number 90, and then paste it into a text editor or a shell. The unredacted text appears!

    • ktpsns 2 hours ago
      This is exactly the type of bad redactions which the X-ray software will also find.
  • seanw444 4 hours ago
    The context for OP posting this is that many of the recently-released Epstein documents were PDFs "redacted" by being drawn on top of.
    • agumonkey 4 hours ago
      I wasn't sure of this, even though sometimes you'd see remains of the original characters near rectangles edges.. does this mean the leaked documents have been de-redacted ?
      • k1t 4 hours ago
        • agumonkey 4 hours ago
          oh that's a beautiful sight

          hopefully this is straw that breaks the camel's back

          • XorNot 4 hours ago
            Why would that be the case? The government isn't redacting "yes we contacted aliens" they're redacting information about military capabilities that might be of use to adversaries.
            • agumonkey 3 hours ago
              sorry the title mentioned epstein files, so i was hoping incriminating facts that would accelerate trump's fall
              • jibal 3 hours ago
                No reason to be sorry ... you are right and the other person seems quite confused about the context.
      • kstrauser 4 hours ago
        • agumonkey 4 hours ago
          yeah i expected every political team, even the low level ones, to be fully aware of naive pdf "edition"... alas, incompetence often does that
          • arthurcolle 4 hours ago
            Checks and balances for a more technological era.
            • airstrike 4 hours ago
              Survival of the leetest
          • zahlman 4 hours ago
            I'm actually surprised not to have yet heard widespread conspiracy theorization that this is deliberate for some inscrutable reason or other.
            • kstrauser 4 hours ago
              Something something "chess, not checkers, this proves he has them on the run!"
    • arthurcolle 4 hours ago
      Also good for UFO/UAP/"anomalous phenomena" documents and remote viewing PDFs for what it's worth :)
    • formerly_proven 4 hours ago
      Is there a good free tool to properly redact PDFs? My workflow is to place black annotation rectangles on top and then print as PDF with "force rasterization" on. The resulting PDF files then just consist of pages with one image each. But this tends to be really suboptimal, because it's usually a grayscale or color rasterization, so file sizes are very large vs. monochrome PDFs with CCITT G3/G4 compression (which is absolutely what you want for text content, excellent compression and lossless). Post-processing PDFs to convert them to CCITT is rather annoying and I only know of CLI ways.
  • 5ak12agff 2 hours ago
    Given that no U.S. or Israeli citizen apart from Epstein and Maxwell has experienced severe repercussions and Andrew Windsor is the perfect fall guy, there is the possibility that nothing will be revealed from these uncovered redactions.

    The releases haven't yielded anything so far. For all we know, Epstein used other methods of communications for the really sensitive stuff. This would not be a surprise, since the whole Maxwell family was deep into tech (Magellan, Chiliad) and Ehud Barak was the head of Israeli military intelligence in the 1980s.

    The story is going to be closed in a bipartisan manner except that it might be used to remove some unwanted politicians. The New York Times has already released an article that "explains" Epstein's wealth which names all figures that appear in "conspiracy theories" in an innocent way. Basically, they claim that Epstein could just steal from billionaires like Wexner and the billionaires would roll over and do nothing.

    That is the official confirmation that all intelligence angles will be squashed in a bipartisan manner. For all we know, the "incompetence" in the redactions may be a way of saying: "See, we have nothing to hide."

  • gigatexal 3 hours ago
    Hilarious that DOJ didn’t flatten the layers so you can unredact stuff. What a clown show of incompetent idiots. Or… a skillful one over on the powers that be internally from someone who knew better but knew that they wouldn’t know … and did this to help us all
  • dcollect 2 hours ago
    lol thanks bros

    text=about them to damage their credibility when they tried to go public with their stories of being text=Epstein also threatened harm to victims and helped release damaging stories =attorneys' fees and case costs in litigation related to this conduct.

    =Defendants also attempted to conceal their criminal sex trafficking and abuse

    text=$327,497.48 and $6,487.04 in New York City text=trafficking and abuse conduct. text=destroy evidence relevant to ongoing court proceedings involving Defendants' criminal sex text=Epstein also instructed one or more Epstein Enterprise participant-witnesses to text=trafficked and sexually abused. text=conduct by paying large sums of money to participant-witnesses, including by paying for their

  • IceHegel 4 hours ago
    Given recent high profile redaction events, I think one simple use of AI would be to have it redact documents according to an objective standard.

    That should in theory prevent overly redacted documents for political purposes.

    An approach that could be rolled out today would be redacting with human review, but showing what % of redactions the AI would have done, and also showing the prompt given to the AI to perform redactions.

    • mmazing 4 hours ago
      Honestly, it doesn't take any inference or need for AI, there's simply data in the documents that can be extracted.
      • bogtog 4 hours ago
        I don't think the commentor above is saying that an AI should necessarily apply the redaction. Rather, an AI can serve as an objective-ish way of determining what should be redacted. This seems somewhat analogous to how (non-AI) models can we used to evaluate how gerrymandered a map is