Industry Leading
eDiscovery Insight

Learn from renowned eDiscovery thought leaders


Learn More

9 Tips for Creating Searchable PDF Documents for Review

The searchable PDF (portable document format) is becoming increasingly relevant to legal professionals in discovery, document review, and related litigation matters. One of the main drivers of this trend, in addition to popularity in corporate environments, are court requirements in many jurisdictions that require pleadings and motions to be filed in PDF. Fortunately, there are many low-cost options that allow firms and organizations to inexpensively create PDFs and newer versions of Adobe Acrobat support Bates numbering and legal redaction. Below are nine best practices that will help you maximize the searchability and benefits of using PDFs in discovery.

1. Choose the ‘Text-Under-Image’ Option: When scanning a document, you may be presented with different options for types of PDF files. If available, you will usually want to choose the option that applies optical character recognition (OCR) to make the document text searchable. This can be implemented in different ways depending on your specific hardware and software, including a ‘”make searchable (apply OCR)” option, or “text-under-image” or “searchable PDF” file type options. This means that your scanned document will be text searchable within the Acrobat viewer and many other programs designed to search PDF files. The other type of PDF you could choose is called an “image-only PDF”, which is not text-searchable. When viewing a PDF file you can tell if a file is searchable by looking for the ‘select tool’ on the top bar in Acrobat Reader. This indicates that the file is text searchable.

WhySearchablePDF2. Get the Resolution Right: When scanning images to PDF for litigation purposes, 300 dpi (dots per inch) is a safe option. Scanning at a lower resolution (e.g. 200 dpi) can work well and will produce a smaller file, but legibility can suffer with smaller fonts (e.g. 6 pt. in financial documents). OCR quality can also suffer from lower scan resolutions. The trade-off is that larger scan resolutions results in larger file sizes. File scans larger than 300 dpi usually do not appreciably increase the readability of a document or its OCR quality.

3. Scan to B&W, Grayscale or Color: For litigation review purposes, ‘Black & White’ is often a good option, particularly with good quality originals, and creates a much smaller file than a grayscale or color scan. Color or grayscale may be required for photos (which do not display well with a ‘black and white’ setting). For some documents, a color scan may be critical to understanding the document, such as some charts (e.g. in Powerpoint presentations) or CAD (computer aided design) documents. Color scans and grayscale scans will be larger than B&W or grayscale scans.

4. Watch the Other Settings: Scanners will often have a number of other settings that can help improve scan quality and OCR. These include ‘deskew’ (rotates any page that is not square with the sides of the scanner bed, to make the PDF page align vertically), ‘background removal’ (whitens nearly white areas of grayscale and color input), and ‘edge shadow removal’ (removes dark streaks that occur at the edges of scanned pages, where the scanner light is shadowed by the paper edge). ‘Deskew’ will help with OCR accuracy, while ‘background removal’ and ‘edge shadow removal’ can improve readability, but can sometimes impair OCR accuracy. For important documents, it’s best to run some tests.

5. Get a Quality OCR program: All OCR is not created equal. The quality of optical character recognition varies substantially based on the quality of the program and the various settings chosen when running the program. Programs often have a ‘fast’ and ‘slow’ mode, with the slow mode usually delivering better quality OCR. Some programs will auto-rotate pages when necessary, and others will not and will make resultant OCR errors.

6. Pay Special Attention to the Numbers: One secret of OCR programs is that they routinely rely on dictionaries to recognize the text of particular characters. This works pretty well with words (if they are in the dictionary), but doesn’t help with numbers or other arbitrary characters not in a dictionary. Expect to see lower quality OCR in financial reports and other number-intensive documents.

7. Make Sure Your Litigation Support Software Really Supports PDF: Many legacy litigation software systems were designed around files saved as TIFFs (tagged image file format), an older type of file format that does not support integrated text as part of the file as PDF does. These older software systems usually have added some support for PDFs, but often the integration with PDF is incomplete and some features are not supported with PDF files.

8. Do Redactions the Right Way: Redactions can be tricky in PDF and this has been a primary reason why TIFF has survived as a popular format in legal matters. A trap for the unwary is that it is possible in a PDF file to redact text on the image of a document, and still have the redacted text be searchable! In a text-under-image PDF file, the redaction must be done on the text and the image. This problem has been fixed in the latest version of Acrobat (Acrobat Professional 8 or 9), and this program can be used for PDF redactions. Third party tools doing redactions are available as well. Many practitioners play it safe with redacted documents by printing, marking out by hand, and rescanning. This method is manual but fool-proof, and works well if number of documents to be redacted is limited.

9. Be Specific in Discovery Requests: Litigators are increasingly asking that documents produced in response to discovery requests be provided in electronic form as PDFs. If you do this, be specific as to the matters above. In particular, be sure to specify that the scan resolution be 300dpi and that the OCR be applied. You may also wish to ask what OCR software is used and what settings will be applied. To be non-specific is to invite an adversary to return documents scanned at 150dpi without OCR, that may be unsearchable, illegible and unintelligible!

Opposing Perspectives in Document Review

Plaintiffs and defendants face unique challenges during discovery. A central difference is asynchronous data; plaintiffs have had fewer data collection concerns while defendants have had to manage complex collection and production of vast ESI stores. On the other hand, defendants often have greater access to resources, while plaintiffs pay more attention to prioritizing expenses and making every dollar count towards a favorable outcome.

While both sides share a duty to competently review relevant case documents, goals and methodologies can differ distinctly. This webinar will cover how these different approaches can benefit from specialized document review strategies and technology.

Key Points

  • Is Discovery Different for Plaintiffs and Defendants?
  • Asynchronous eDiscovery
  • Differing Resources
  • Contingency Arrangements
  • Examining Plaintiff and Defendant Discovery Concerns
  • Top Takeaways for Plaintiffs and Defendants

About the Speaker

Gene Albert is the CEO of Lexbe, and a frequent speaker and writer on litigation technology and eDiscovery topics. He is on the Planning Committee of the Texas State Bar eDiscovery Program. Gene has his JD from Southern Methodist University and his MBA from the University of Texas at Austin.

Read More

Embracing the Power of the Cloud in eDiscovery

Cloud infrastructures provide an unrivalled opportunity to control eDiscovery costs without sacrificing quality, functionality, or security. Firms without large litigation support departments can instantly increase their discovery and review capacities with on demand/SaaS eDiscovery – often allowing them to take on cases they were previously unable to.

When firms make the decision between bringing litigation support and eDiscovery functions in-house or looking for scalable external solutions, there are several questions whose answers weigh heavily on the decision. How big is the firm? Do caseloads support the newly built capacity? Is there capital available to cover the high fixed costs and overhead associated with establishing or expanding an internal support department? Smaller firms consider these questions and often find themselves caught between opportunities to grow and take on larger cases and constraints imposed by the costs associated with establishing the necessary internal infrastructures.

Cloud-based ediscovery technologies provide another option: firms can immediately access massive eDiscovery capacities, expert litigation support professionals, and advanced review capabilities and pay only for what they actually use. Cloud solutions like Lexbe eDiscovery Platform can support document intensive litigation from collection through to trial. All you need is access to a browser and the internet to use most web-based/cloud ediscovery tools.

5KeyFactsCloudeDiscoveryThe Cloud is a term that has developed to describe scaled, non local storage of data. That is, instead of saving a file to a computer’s local hard drive, one can save that file to a centrally managed and expertly maintained hard drive that lives in a highly secure and professionally staffed server facility. The connection between the local computer and files in the cloud is the internet. One major benefit of cloud infrastructure for legal professionals is accessibility. Because your files are stored in the cloud, litigation teams can access them from any internet-capable device. Storing files in the cloud means that being separated from a computer or phone doesn’t really separate you from your data.

In addition to keeping ESI within reach, the cloud also keeps your data out of the hands of unauthorized persons. Data centers specialize in creating the most secure environments in the world with a wide range of measures employed to ensure redundant storage and encrypted access. The largest banks in the world rely on the security of the cloud whose security features are practically inimitable on a local computer. When you hear criticisms of cloud security, it is important to remember that “cloud” is a general, descriptive term and not all cloud providers are providing the same security protocols. It is critical when considering cloud software to ask who the infrastructure partner is to ensure credibility. For instance, Lexbe eDiscovery Platform and services run on AWS SOC III Servers, which were recently named as the sole “leader” in service provider security in a recent Forrester Research Report.

In an eDiscovery context, firms of all size have taken advantage of the accessibility and security benefits offered by the cloud. For small and mid sized firms. these benefits are much more stark and also include substantial cost reductions and efficiencies. It is simply not financially expedient or necessary for smaller firms to pay for expensive in-house installation of software and servers, in addition to hiring additional litigation support staff that will need to manage them locally. There are better options available.

Choosing a Production Format for Your Case

There are a variety of acceptable production formats, each with their own benefits and drawbacks. To determine the best fit for your case, look down the road and consider the scope, goals, and methodology of your review.

An ‘electronic search’ approach to discovery requires that all documents be converted to an electronically searchable form and that a method of searching across all files is available. For electronic documents delivered in native file format, search is usually possible in some form or another. This is particularly true for standard Microsoft Office documents. Email presents more difficulties, as email attachments may need to be deconstructed from the electronic file holding the email to be searched. Paper-based documents must be scanned and OCRed to make them searchable as electronic files. The OCR process inevitably introduces OCR errors, which diminishes the effectiveness of the electronic search, as compared with the search of native files or electronic documents based on native files.

The ‘electronic search’ approach also requires that all documents are addressable as a collection from a single search query. Litigation document repositories may be established to make all documents accessible and searchable, often between multiple parties in different locations. These systems may be comprehensive and expensive. Alternatively, a law firm may make documents searchable from a file server on its local area network, or run LAN-based case management software, which may allow for indexing and searching of litigation files. For a very small case, all documents might be stored on a single CD or DVD, or kept on a portable hard drive, and searched from the Windows operating system.

Attorneys are now taking several approaches to e-Discovery when searchability or metadata are important. Each approach has its own advantages and disadvantages.

TIFF

A TIFF file is a raster-based image most commonly used in the transmission of faxed pages. Many litigation document management programs were developed using TIFFs as a key part of their program architecture. TIFF files are images and usually do not store computer readable text within the file. Instead, litigation document management systems associate text from a separate text file as part of what is known in the litigation support industry as a ‘load file’.

Advantages of TIFF productions:

  • Ease of Bates Numbering: Bates Stamping is used to identify which documents have been produced, particular documents and pages in connection with wietness examinations, and which documents have been withheld for privilege. TIFFs can be single or multi-paged. Historically, litigation support vendors have often scanned paper documents, or convertd electronic documents into single-paged or multi-paged TIFFs, with each file name being the Bates Number or Bates Number Range. Each individual page in a production would have its own Bates Number.
  • Improved Redaction: Documents sometimes need to be partially redacted to remove references to privileged information, work product or trade secret information, identify which documents have been produced, particular documents and pages in connection with witness examination, and which documents have been withheld for privilege. As a raster image, TIFF files are relatively easy to redact, as compared with native files or PDF files. However the recent release of Acrobat Professional 8 with a built in PDF redaction tool has lessened this advantage of TIFF files.
  • Requirements of Legacy eDiscovery Platforms: Several legacy litigation support management systems work best or exclusively with TIFF files because these systems were designed when TIFF files were the only viable option. These systems predate the development and popularity of PDF and native file review tools.

Disadvantages of TIFF productions:

  • Complex Load Files: Because TIFF files are raster images, they do not retain computer readable text as part of the file
  • Not Very Usable Outside of Legacy Systems: Because of the complexities of the TIFF load file, these files are not very accessible or usable outside of the legacy litigation management systems for which they were designed.
  • Metadata Not Retained in TIFFs: Metadata is not retained as part of a TIFF conversion. To address this shortcoming, many e-Discovery providers now separately save file metadata in a database prior to a TIFF conversion.
  • Cost of TIFF Conversion and Load File Creation: Because of the shortcomings above, a TIFF production requires that the producing party pay to convert electronic files to TIFF images and create the associated text load file so that TIFF-based litigation management systems can read it. This can be very expensive in large productions.

PDF

A more modern approach is to convert electronic files to searchable PDF files for a discovery production. PDF files overcome many of the limitations of working with native files. Indeed, Adobe created both the TIFF and PDF formats and designed PDF as a more functional replacement for the TIFF. PDFs have become ubiquitous in business and in law.

Advantages of PDF Format:

  • Viewable in Adobe Acrobat: Files are searchable and easy to work with. Anyone with Adobe Acrobat can view a file without the need to worry about having the right application program or viewer installed.
  • Bates Stamping: Documents can be bates-stamped and pages specifically identified using a variety of software tools.
  • Redaction: Pages or specific passages can be redacted with Adobe’s latest version 8 of its Acrobat Professional program.
  • Some Metadata Retained: A PDF conversion can be set up to retain some of the metadata and then it can be viewed reviewing certain properties in the PDF file. Retention of metadata in a PDF file is not automatic, and is dependent on the conversion software used and settings used in the conversion process.

Disadvantages of PDF Format:

  • Conversion Cost: As with TIFF files, conversion of electronic files to PDF requires expenditures, as compared with simply delivering native file format.
  • Not all Metadata Available: A standard PDF conversion only captures some of the available metadata. Information such as the document author and title typically may be captured. The document creation date may be changed to the date the PDF is created. Other key metadata, such as last save, last print, edit time, deletions, comments and hidden text usually are not captured in the PDF copy.

Native Format

Some litigation professionals pursue discovery in native file format, the original file format in which the electronic file was produced, such as Word, Excel or Outlook. This has become more popular since the new federal e-Discovery Amendments as it provides the requesting party greater leeway in requesting files in native format.

Advantages of Native Format:

  • No Conversion Expense: Unlike TIFF or PDF productions, there is no conversion expense in delivering files in native format.
  • All Metadata Retained: All file metadata can be retained in a native production.
  • Text Searchable: Text is usually searchable the best in native format. There is no chance of text being lost or corrupted in a file conversion to PDF, or a TIFF load file, or the introduction of OCR errors.
  • Some Documents Don’t Display Well in other Formats: Native may be the only practicable format for some file formats, such as spreadsheets. Excel and other spreadsheet files are notorious for converting poorly to TIFF or PDF, often becoming unintelligible. Plus, spreadsheet formulas, hidden cells, and hidden text usually do not make the conversion to other formats.

Disadvantages of Native Format:

  • Difficulty of Pre-Release Review of Metadata: Metadata, by design, are not easy to review in native file format. Some metadata in Office files can been found by clicking through various property screens, but this is time-consuming, requires a consistent methodology to view all viewable metadata, and end the end does not access all available metadata available in the file. Newer litigation management systems will display metadata of native files.
  • Difficulty in Bates Stamping at the Page Level: Documents in native file format cannot be easily Bates-stamped, and any Bate stamping will change the metadata. Often Bates stamping of native files is handled instead through a file naming convention, in which the file name is modified to include a Bates designation. This can work well, but does not allow for page-level identification.
  • Inability to Easily Redact: Documents produced in native file format cannot be easily redacted. For this reason, in a native production, documents that need to be redacted are often handled in a different manner, such as converting redacted documents to another format that can be redacted, such as PDF.
  • Difficulty of Pre-Release Review: Attorneys for the party producing electronic files must review the files to see if they are responsive to the discovery request or include privileged information or trade secrets. This can be difficult as electronic files may have been created in multiple applications. Modern litigation support applications allow most native file formats to be reviewed without installing the applications that created the file. Plus, modern litigation support applications allow metadata of native files to be reviewed in an easy fashion.

Advances in technology are reshaping how litigation discovery is handled. Use and availability of electronic documents is changing how discovery is done, with an increasing emphasis on search. Additionally, metadata availability in electronic files requires that litigators find effective tools to review and analyze this new source of information. New discovery rules reflect the reality of available technology and prior paper-based approaches are ineffective and have become outmoded.

The best eDiscovery production format will usually turn on methodologies and workflows attorneys and litigation teams plan on using to review the files. Document management systems usually are optimal for files in certain formats. Plus, consideration should be given on how Bates numbering and redaction will be handled before choosing a format.

OnDemand eDiscovery Processing

Whether you are a litigation services provider or internally managing litigation support functions within a law firm or corporate legal department, ESI processing deadlines can often exceed capacity. In order to increase internal capacity, past options have included the purchase of additional processing software licences and local hardware, as well as increased staffing, to marginally increase throughput and meet demand.

But what about when processing demand temporarily recedes? eProcessing+ enables service providers and litigation support departments to scale up for periods of high demand and scale down when things cool off. And with no associated hardware or licensing costs, plus reduced staffing requirements, eProcessing+ lets you align case costs with revenues.

Key Points

  • eDiscovery processing and the EDRM
  • Fast, scalable processing: why it’s needed
  • Traditional processing workflows
  • Balancing processing demands with internal resources
  • Cost-efficiently increasing capacity and speed
  • Features and benefits of eProcessing+
  • Integrated ECA, processing, and review to speed throughput and reducetotal costs
  • Summary

About the Speaker

Stu Van Dusen is an eDiscovery solutions consultant with Lexbe, and a frequent speaker and writer on litigation technology. He has his MS Technology Commercialization from UT Austin, and his BS in Business Administration and Management from Trinity University.

Read More

Latest Blog

Subscribe to LexNotes

LexNotes is our monthly newsletter of eDiscovery and legal document management and review tips and best practices.