2019-09-27

The pace of PACER

Permit me a brief, hilarious diversion into the world of US Government corporate IT. PACER is a USA federal online system - "Public Access to Court Electronic Records" which lets people and companies access transcribed records from the US courts. One of their judges has been testifying to the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet and in the process revealed interesting - and horrifying - numbers.

TL;DR -

  1. it costs at least 4x what it reasonably should; but
  2. any cost savings will be eaten up by increased lawyer usage; nevertheless,
  3. rampant capitalism might be at least a partial improvement; so
  4. the government could upload the PACER docs to the cloud, employ a team of 5-10 to manage the service in the cloud, and save beaucoup $$.
Of course, I could be wrong on point 2, but I bet I'm not.

Background

PACER operates with all the ruthless efficiency we have come to expect from the federal government.[1] It's not free; anyone can register for it, usage requires a payment instrument (credit card) but it is free if you use less than $15 per quarter. The basis of charging is:

All registered agencies or individuals are charged a user fee of $0.10 per page. This charge applies to the number of pages that results from any search, including a search that yields no matches (one page for no matches). You will be billed quarterly.
You would think that, at worst, it would be cost-neutral. One page of black+white text at reasonably high resolution is a bit less than 1MB, and (for an ISP) that costs less than 1c to serve on the network. Therefore you spend less than 9c on the machines and people required to store and serve the data, and profit!

Apparently not...

The PACER claims

It was at this point in the article that I fell off my chair:

Fleissig said preliminary figures show that court filing fees would go up by about $750 per case to “produce revenue equal to the judiciary’s average annual collections under the current public access framework.” That could, for example, drive up the current district court civil filing fee from $350 to $1,100, she said.
What the actual expletive? This implies that:
  1. the average filing requests 7500 pages of PACER documents - and that the lawyers aren't caching pages to reduce client costs (hollow laughter); or
  2. the average filing requests 25 PACER searches; or
  3. the average client is somewhere on the continuum between these points.
It seems ridiculously expensive. One can only conclude, reluctantly, that lawyers are not trying to drive down costs for their clients; I know, it's very hard to credit. [2]

And this assumes that 10c/page and $30/search is the actual cost to PACER - let us dig into this.

The operational costs

Apparently PACER costs the government $100M/year to operate:

“Our case management and public access systems can never be free because they require over $100 million per year just to operate,” [Judge Audrey] Fleissig said [in testimony for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet]. “That money must come from somewhere.”
Judge Fleissig is correct in the broad sense - but hang on, $100M in costs to run this thing? How much traffic does it get?

The serving costs

Let's look at the serving requirements:

PACER, which processed more than 500 million requests for case information last fiscal year
Gosh, that's a lot. What's that per second? 3600 seconds/hour x 24 hours/day x 365 days/year is 32 million seconds/year, so Judge Fleissig is talking about... 16 queries per second. Assume that's one query per page. That's laughably small.

Assume that peak traffic is 10x that, and you can serve comfortably 4 x 1MB pages per second on a 100Mbit network connection from a single machine; that's 40 machines with associated hardware, say amortized cost of $2,000/year per machine - implies order of $100K/year on hardware, to ensure a great user experience 24 hours per day 365 days per year. Compared to $100M/year budget, that's noise. And you can save 50% just by halving the number of machines and rejecting excess traffic at peak times.

The ingestion and storage costs

Perhaps the case ingestion is intrinsically expensive, with PACER having to handle non-standard formats? Nope:

The Judiciary is planning to change the technical standard for filing documents in the Case Management and Electronic Case Filing (CM/ECF) system from PDF to PDF/A. This change will improve the archiving and preservation of case-related documents.
So PACER ingests PDFs from courts - plus, I assume, some metadata - and serves PDFs to users.

How much data does PACER ingest and hold? This is a great Fermi question; here's a good worked example of answer, with some data.

There's a useful Ars Technica article on Aaron Swartz that gives us data on the document corpus as of 2013:

PACER has more than 500 million documents
Assume it's doubled as of 2019, that's 1 billion documents. Assume 1MB/page, 10 pages/doc, that's 10^9 docs x 10 MB per doc = 10^10 MB = 1x10^4 TB. That's 1000 x 10TB hard drives. Assume $300/drive, and drives last 3 years, and you need twice the number of drives to give redundancy, that's $200 per 10TB per year in storage costs, or $200K for 10,000 TB. Still, noise compared to $100M/year budget. But the operational costs of managing that storage can be high - which is why Cloud services like Amazon Web Services, Azure and Google Cloud have done a lot of work to offer managed services in this area.

Amazon, for instance, charges $0.023 per GB per month for storage (on one price model) - for 10^9 x 1MB docs, that's 1,000,000 GB x $0.023 or $23K/month, $276K/year. Still way less than 1% of the $100M/year budget.

Incidentally Aaron Swartz agrees with the general thrust of my article:

Yet PACER fee collections appear to have dramatically outstripped the cost of running the PACER system. PACER users paid about $120 million in 2012, thanks in part to a 25 percent fee hike announced in 2011. But Schultze says the judiciary's own figures show running PACER only costs around $20 million.
A rise in costs of 5x in 6 years? That's approximately doubling every 2 years. As noted above, it seems unlikely to be due to serving costs - even though volumes have risen, serving and storage costs have got cheaper. Bet it's down to personnel costs. I'd love to see the accounts break-down. How many people are they employing, and what are those people doing?

The indexing costs - or lack thereof

Indexing words and then searching a large corpus of text is notoriously expensive - that's what my 10c per electronic page is paying for, right? Apparently not:

There is a fee for retrieving and distributing case information for you: $30 for the search, plus $0.10 per page per document delivered electronically, up to 5 documents (30 page cap applies).
It appears that PACER is primarily constructed to deliver responses to "show me the records of case XXXYYY" or "show me all cases from court ZZZ", not "show me all cases that mention 'Britney Spears'." That's a perfectly valid decision but makes it rather hard to justify the operating costs.

Security considerations

Oh, please. These docs are open to anyone who has an account. The only thing PACER should be worried about is someone in Bangalore or Shanghai scraping the corpus, or the top N% of cases, and serving that content for much less cost. Indeed, that's why they got upset at Aaron Swartz. Honestly, though, the bulk of their users - law firms - are very price-insensitive. Indeed, they quite possibly charge their clients 125% or more of their PACER costs, so if PACER doubled costs overnight they'd celebrate.

I hope I'm wrong. I'm afraid I'm not.

Public serving alternatives

I don't know how much Bing costs to operate, but I'd bet a) that its document corpus is bigger than PACER, b) that its operating costs are comparable, c) that its indexing is better than PACER, d) that its search is better than PACER, e) that its page serving latency is better than PACER... you get the picture.

Really though, if I were looking for a system to replace this, I'd build off an off-the-shelf solution to translate inbound PDFs to indexed text - something like OpenText - and run a small serving stack on top. That reduces the regular serving cost, since pages are a few KB of text rather than 1MB of PDF, and lets me get rid of all the current people costs associated with the customized search and indexing work on the current corpus.

PACER is a terrible use of government money

Undoubtedly it's not the worst[3], but I'd love for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet to drag Jeff Bezos in to testify and ask him to quote a ballpark number for serving PACER off Amazon Web Services, with guaranteed 100% profit margin.

Bet it's less than 1/4 of the current $100M/year.

[1] Yes, irony
[2] Why does New Jersey have the most toxic waste dumps and California the most lawyers? California New Jersey got first choice. [Thanks Mr Worstall!]
[3] Which is terribly depressing.

2 comments:

  1. Troy Hunt was running haveibeenpwned for $9/day (yes NINE WHOLE DOLLARS) on Azure functions, serving up 141m queries per month.

    https://www.troyhunt.com/serverless-to-the-max-doing-big-things-for-small-dollars-with-cloudflare-workers-and-azure-functions/

    ReplyDelete
  2. Stigler - hehe, I wish someone at the hearing had brought that up with Judge Fleissig.

    ReplyDelete

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.