Password protected checker for PDF, DOC and DOCX files

Project: Rails (monolith) Objective: Verify if uploaded files are password-protected and prompt users to provide unprotected versions if necessary.

At first, I thought plain JavaScript would be enough for the file check. Turns out, not really:

User could bypass the check if they wanted to. (Not a big deal, but still) Anyways, we would need to replicate the same validation on the server side.
The most right approach would be to use a system library to try and open the file and see if it's password protected, then we would know for sure. And at system level, we can only do that on the server side.
But where is the fun? Coding is learning too, so I started to code the solution in JavaScript.

PDF files turned out to be the easiest to check. You can read the file as binary data and search for encryption-related byte patterns:

  async #isPdfPasswordProtected(file) {
    return new Promise((resolve, reject) => {
      const reader = new FileReader();

      reader.onload = function (e) {
        const data = new Uint8Array(e.target.result);
        let pdfString = '';
        const chunkSize = 8192;

        for (let i = 0; i < data.length; i += chunkSize) {
          pdfString += String.fromCharCode.apply(null, data.subarray(i, i + chunkSize));
        }

        const isEncrypted = /\/Encrypt\s*\d+\s+\d+\s+R/.test(pdfString);
        resolve(isEncrypted);
      };
      reader.onerror = function (e) {
        reject(e);
      };

      reader.readAsArrayBuffer(file);
    });
  }

That worked for read-protected PDF files but not for edit-protected PDF files (it wasnt a requirement for this project though).

DOCX files are basically ZIP archives with XML files inside. So the idea is: open the DOCX as a ZIP and check if [Content_Types].xml is there. If it's missing or unreadable, the file is probably password protected.

This worked well, but I needed to add JSZip as a dependency to handle the ZIP operations in JavaScript.

DOCX checking function:

  async #checkDocxEncryption(arrayBuffer) {
    try {
      const zip = await JSZip.loadAsync(arrayBuffer);
      const hasEncryptionInfo = zip.file('EncryptionInfo') !== null;
      const hasEncryptedPackage = zip.file('EncryptedPackage') !== null;

      if (hasEncryptionInfo || hasEncryptedPackage) {
        return true;
      } else {
        try {
          const documentXml = await zip.file('word/document.xml')?.async('string');
          if (documentXml) {
            // Able to read document.xml, file is not encrypted.
            return false;
          } else {
            // Cannot read document.xml, file may be encrypted.
            return true;
          }
        } catch (error) {
          // Error reading document.xml, file may be encrypted
          return true;
        }
      }
    } catch (err) {
      console.error('Error reading .docx file, likely encrypted:', err);
      return true;
    }
  }

DOC files were a different story. The older DOC format uses Microsoft's proprietary compound file binary format, and parsing that in JavaScript alone was painful.

I ended up finding a Ruby library that handles DOC password detection, which made me move all the file checks to the backend. Simpler to maintain everything in one place anyway.

Final backend solution:

Gist with the final code

I've extracted and used some parts of docx gem, and, from msworddoc-extractor gem.

Requirements:

Install imagemagick at Linux/server
Add gem image_processing for PDF checking.
Add gem ruby-zip for DOCX checking
Add gem ruby-ole for DOC checking

After all of this considered, we could choose to install libreoffice on the server, and trying to open the file using libreoffice cli. Its a nice alternative, perhaps a better one. Its a big package on the server though and I dont know the impact on performance on the system nor on the processing of each file.

Learned a lot digging into file formats on this one.