A joint project of Law Technology News and Law.com Legal Technology

LTN Law.com

« Orion Communications Releases Virtual Viewer | Main | Wave Software President in HTCN Interview »

March 31, 2008

Making a Hash of It

Cornflakes_again_3One day, that sexy ingenue we call "e-discovery" will settle down to a matronly middle age of just being “discovery,” and the phalanx of bloggers dissecting every trial judge’s best guess about electronic evidence will settle down, too.

Until then, I’m grateful to the blogging multitudes who open their diaries to the web and tap-tap-tap away, insuring that we miss absolutely nothing.


One of my favorite blogs is Ralph Losey’s “e-Discovery Team.”  Ralph writes artfully, passionately and with good humor about EDD topics.  A thorough exploration of his site’s many byways is a virtual e-discovery certification course.  The posts are personal enough in tone to know where Ralph stands, but never offputting. Even on those infrequent occasions when I question his conclusions, Ralph’s sincerity and reflection are never in doubt.


It’s one of those rare disagreements that prompts this post. 

On March 31, 2008, Ralph blogged the case of Petcou v. C.H. Robinson Worldwide, Inc. 2008WL 542684 (N.D. Ga., Feb. 25, 2008).  His post covers the holding splendidly.  In a nutshell, four women brought suit against their employer for harassment encompassing pornographic content allegedly appearing on co-workers’ computer screens.  The court opined that the plaintiffs’ request for production would require a search of all e-mails of all employees in the Atlanta North and Atlanta South branches for a six-year period and questioned, “how Defendant would determine whether e-mails were ‘relating to or evidencing the presence of pornography’ or were ‘of a sexual or gender derogatory nature’ without examining the content of each and every e-mail and without making judgments about what constitutes pornography.”


As so many EDD cases do, the dispute centered on restoration of email on backup tapes.  Finding that the information sought from the tapes was not reasonably accessible because of undue burden or cost, the court failed to find good cause and declined to compel all the production sought.


Ralph Losey comments:

“Magistrate Judge Brill then goes on to conclude that good cause has not been established, primarily because the over broad nature of the request would require defendants to examine every email in the system. The volume of email existing over the two-year span of the backup tapes is not stated, but, for a company such as this with 5,300 employees, there would typically be millions of emails. The cost to review this much email would be astronomical.”

Ralph adds:

“Of course, plaintiffs could have tried to avoid all of this with a clear and very focused request to produce. For instance, they could have requested production of specific pornographic images with known hash values. See Krause v. Texas, 243 S.W. 3d 95 (1st DCA, Oct. 31, 2007). This is the way police searches are normally conducted to locate pornography. This kind of search could have been fully automated, and might not have imposed an undue burden on defendant.”

This is where I part ways with Ralph.  He’s right on target emphasizing the need for clear, narrowly-tailored requests, but the wheels come off when he posits hash matched searches as a simple, low cost resolution.  That approach suffers from three hiccups: it’s not simple, it’s costly and, here, it won’t work.


By way of background, Ralph Losey and I share a rhapsodic admiration for the use of hash values to simplify authentication and identification in electronic discovery.  Hashing is the ability to represent any digital data—a file, digital photo or entire contents of a hard drive—as an alphanumeric value of fixed length (e.g., 32 characters using the popular MD5 hash algorithm).  This hash value is tantamount to a unique fingerprint for the data, such that the chance of two differing files sharing the same hash value is an infinitesimal 1-in-340 trillion trillion trillion!  It’s a wondrous technology.


Yes, law enforcement does use hash values to locate known child pornography, a practice with several advantages, among them that it eliminates the need to make judgments about the age of the victim or whether the image is real or synthesized.  Additionally, it’s a way to ease the emotional toll on the examiner (stemming from repeated viewing of pretty horrific stuff) and eliminate the need to re-publish and distribute contraband images.  To facilitate this process, the government makes extensive hash lists of known kiddy porn images freely available to law enforcement.


In my work as a forensic examiner, I also use hash values as a means to track down company secrets that wander off when employees quit for greener pastures.  By hashing secret data, you can reveal its digital fingerprint to the other side and comparisons can be made using hash values without further compromising the data.


But the hole in Ralph’s reasoning is that you can’t calculate matching hash values against data until you have the data being evaluated in hand and in a form utterly identical to the data used to generate the hash values. Thus, the suggestion that hashing might eliminate undue burden or fully automate the process ignores the fact that the data itself must be extracted from the backup tapes before it can be hashed.  You still need to restore all those tapes. 


Further, the images embedded in the e-mail must be further extracted from the messages before they can be checked for matching hash values.  Before images can travel as attachments to e-mail, they are encoded as alphanumeric characters using a format called “Base64” or “MIME.”  Once encoded, the photo and its identical counterpart traveling via e-mail no longer share the same hash value though they still represent identical images.  So, all of the e-mail must be processed before hash comparisons are feasible.


Another practical problem with Ralph’s approach is that it assumes that the plaintiffs can supply hash values of the digital image files of which they complain.  Insofar as I’m aware, there are no government hash sets that purport to contain all or even most of the non-contraband sexually-explicit imagery extant, and it would be impossible to fashion your own unless you already have electronic versions of the data. Even the same photograph scanned on the same scanner using the same software will never share the same hash value, because even a single variation of a single pixel will radically alter the hash value.  Such miniscule variation is inevitable.


Even after all this trouble—and by now you’ve incurred the lion’s share of the costs to get to the inaccessible content—the hash values don’t save you all that much trouble because, once the images are extracted and reconstituted as, e.g., JPG files, it’s not an especially onerous burden to filter for just image file attachments and do a manual review.  Using proper tools, it’s even feasible to further filter the images for skin tone values and aspect ratios most often associated with pornographic images.


The reason I make such a big deal of this is that Ralph Losey is an influential voice in these matters.  If Ralph improvidently suggests there’s little burden to recovering images on voluminous email backup tapes once you simply specify matching hash values, there’s a chance that someone might accept that as true.  It’s not.  Hash matching has its place—and when it’s the right technology, it will save a lot of time and money--but, sorry Ralph, not on these facts.


Again, don't judge Ralph's terrific site by this one (IMHO) gaffe--he get's it right far more than most, and doubtlessly more than yours truly.  CDB

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345280a669e200e5519f50888834

Listed below are links to weblogs that reference Making a Hash of It:

Comments

Thank you for posting this. As a security professional who has to respond to discovery requests, it's refreshing to find someone with the comprehension of what a monumental task it is to reproduce what seems, to the layman, a simple task.

I find that most people, when thinking of reproducing digital content, think of their own personal experiences. Restoring a single image attached to a single email is a relatively trivial task using tools such as munpack. Automating that process to sift through millions of emails, extract the images and uniquely name each one AND tie it to the original email is something else altogether.

Thanks Craig for catching that mistake. I have already corrected it. I have also added a new paragraph to the blog publicly thanking you for the insights and extolling the virtues of instant peer reviews made possible by the Internet. Please continue and let me know anytime I err again in the future. And thanks for being so darn nice about it!

Post a comment

If you have a TypeKey or TypePad account, please Sign In



An Affiliate of the Law.com Network

From the Law.com Newswire

Sign up to receive Legal Blog Watch by email
View a Sample


Subscribe to this blog's feed

PODCAST: Law Technology Now

Monica Bay

In this new monthly podcast, editor-in-chief of Law Technology News Monica Bay interviews key experts of the legal technology community on top issues confronting the legal profession.

Go to Podcast

RSS Feed: LTN Podcast

Monica Bay's Law Technology Now Podcasts are also available as an RSS feed.

Go to RSS Subscribe page




July 2009

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Blog Directory - Blogged