top of page

News & Publications

E-Discovery: Don't Be Afraid to Be Smart—Select While You Collect 

by: Joel Henry, Law Technology News  

Date: 3/5/2022


Preservation requires parties to ensure that electronically stored information "is protected against inappropriate alteration or destruction," states the Electronic Discovery Reference Model. Collection, on the other hand, entails “gathering ESI for further use in the e-discovery process," EDRM continues. Collection is something that many teams simply equate with preservation—if ESI is preserved, then it is collected and is fed into the review process. This mistake results in inflated ESI review costs. 


The Rand Institute for Civil Justice's 2012 report, “Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery,” found that just 8 percent of the total cost of producing electronic items stems from collection. The report also acknowledges the respondents admitted to not tracking these costs in any systematic way. Perhaps this results from IT departments failing to track hours, or maybe the difficulty of assessing hourly costs and processing overhead. No matter the cause for this lack of public data, collection of rapidly increasing amounts of electronic information—stored in multiple locations, for multiple custodians, relating to multiple issues—is expensive to collect and even more expensive to review. Without a smart collection process, culling of the data set relies on the old and error-prone keyword selection process.


Collecting everything as the only (or even best) answer also comes from lawyers. Afraid of missing something, anything at all, legal teams default to all-encompassing review of everything collected. Many vendors are perfectly happy to support and even encourage this approach. Charging by the gigabyte, the hour, the custodian or some combination of all three doesn’t reduce the size of a vendor bill.  Once everything under the sun, moon and stars has been collected, the next step is culling—reducing the massive collection of electronic documents to something more more manageable, but not always more focused. Again, rarely complain about supporting this process, especially coding keyword searches and running them multiple times.


Don’t Cull, Select Smartly


An effective preservation process protects the legal team from sanctions and is both legally required and strategically necessary. Preserving everything makes sure nothing will be lost after identification occurs.  However, equating preservation to collection produces two important blind spots.


The first blind spot occurs when additional electronic information is found or developed after initial preservation. The simple, yet costly, approach dumps all new ESI into the review pile. Re-executing the keyword culling routine exposes the team to inconsistent results. If the newly-culled data set differs from the previously culled set, confidence falls as to the culling process while fears of attack from opposing counsel rise.


The second blind spot results from the inability to precisely track differences between repeated culling processes as new issues arise and additional information from the data set is needed. And it's almost inevitable that as the legal team pours through the data and learns more about the case, those new issues will arise. In addition, existing issues may become intertwined with information previously considered non-relevant. 

No matter how sophisticated culling can be, it’s the reverse of what should be done. Rather than culling, which is the elimination of ESI, the process should be smart collection of ESI that is most likely relevant. Smart collection allows repeatable, traceable, measureable collection to occur as many times as needed, with as many issues as needed, on a changing data set. In fact, smart collection is a coarse application of technology assisted review—coarse in that it need not be finely tuned to find the exact data needed, but rather need only identify a reduced set of ESI likely to be needed.


Smart Collection in Practice


Smart collection starts with specification of  issues and desired information—in English, not keywords. If legal team leaders can talk about the issues, communicate with clients and opposing counsel and coordinate the legal staff, the specification can be written. The concepts underlying the specification should drive smart collection from preserved ESI.


Using written issues, smart collection tools employ computational linguistics and natural language processing to collect those electronic items containing concepts that match issues and desired information. The smart collection technology need not be perfect, in that the threshold for matching the concepts underlying legal issues to concepts within preserved ESI need only collect likely data from the preserved data set. 


For example, if the legal team specifies this: “Information about Jim Miller and the November 2013 Costco Contract,” the smart collection tool should identify email with this “Miller, 2013, Costco, agreement” as well as “Jim’s, big, Costco, deal,” and “JM, signed, Costco, sale.” Natural language processing can find equivalent concepts even when words fail to match exactly. Of course, the smart collection tool might identify “Miller, October, Costco, Contract” which isn’t pertinent, but again this is a coarse collection process with the goal of selecting likely information, not precise information as is done in review.


Smart collection uses TAR in a cyclic process that can evaluate itself. Each cycle leverages the input and results of previous cycles to evaluate effectiveness and produce only the newly identified ESI. Such a process can quickly tell the legal team which new data has been selected and exactly which concepts caused a data item to be selected.   


Smart collection takes place like this:


Initial collection from preserved ESI results in both a refined and expanded issue specification as the team reviews collected ESI (this happens naturally as a case or controversy develops).


Refined and new issue specifications drive new and changed collection, which is automated, measured and fully tracked.


Using smart collection results and initial review markings accuracy, consistency, recall and precision can be calculated and used to assess effectiveness, cost and progress toward completion.


Issue specification and document review can be revised and improved, and be agile enough to handle new issues and new data sources.


Smart collection results in far more defensible results with analytics involved. Smart collection is automated making it extremely cost effective—it can run overnight while preserved and reviewed ESI is at rest. Smart collection is far less susceptible to human error than predictive coding which relies heavily on consistent seed set review.


It is time to move away from the monolithic step-by-step, carry all the data through all the processes, first generation of electronic data discovery technology. Instead,  use tools that fit the natural language problem inherent in e-discovery. Smart collection leverages the evolution occurring in technology assisted review and implements a cost effective data selection process that produces better results faster. Don’t be afraid to be smart—use a process that works.

bottom of page