2019-09-27

The pace of PACER

Permit me a brief, hilarious diversion into the world of US Government corporate IT. PACER is a USA federal online system - "Public Access to Court Electronic Records" which lets people and companies access transcribed records from the US courts. One of their judges has been testifying to the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet and in the process revealed interesting - and horrifying - numbers.

TL;DR -

  1. it costs at least 4x what it reasonably should; but
  2. any cost savings will be eaten up by increased lawyer usage; nevertheless,
  3. rampant capitalism might be at least a partial improvement; so
  4. the government could upload the PACER docs to the cloud, employ a team of 5-10 to manage the service in the cloud, and save beaucoup $$.
Of course, I could be wrong on point 2, but I bet I'm not.

Background

PACER operates with all the ruthless efficiency we have come to expect from the federal government.[1] It's not free; anyone can register for it, usage requires a payment instrument (credit card) but it is free if you use less than $15 per quarter. The basis of charging is:

All registered agencies or individuals are charged a user fee of $0.10 per page. This charge applies to the number of pages that results from any search, including a search that yields no matches (one page for no matches). You will be billed quarterly.
You would think that, at worst, it would be cost-neutral. One page of black+white text at reasonably high resolution is a bit less than 1MB, and (for an ISP) that costs less than 1c to serve on the network. Therefore you spend less than 9c on the machines and people required to store and serve the data, and profit!

Apparently not...

The PACER claims

It was at this point in the article that I fell off my chair:

Fleissig said preliminary figures show that court filing fees would go up by about $750 per case to “produce revenue equal to the judiciary’s average annual collections under the current public access framework.” That could, for example, drive up the current district court civil filing fee from $350 to $1,100, she said.
What the actual expletive? This implies that:
  1. the average filing requests 7500 pages of PACER documents - and that the lawyers aren't caching pages to reduce client costs (hollow laughter); or
  2. the average filing requests 25 PACER searches; or
  3. the average client is somewhere on the continuum between these points.
It seems ridiculously expensive. One can only conclude, reluctantly, that lawyers are not trying to drive down costs for their clients; I know, it's very hard to credit. [2]

And this assumes that 10c/page and $30/search is the actual cost to PACER - let us dig into this.

The operational costs

Apparently PACER costs the government $100M/year to operate:

“Our case management and public access systems can never be free because they require over $100 million per year just to operate,” [Judge Audrey] Fleissig said [in testimony for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet]. “That money must come from somewhere.”
Judge Fleissig is correct in the broad sense - but hang on, $100M in costs to run this thing? How much traffic does it get?

The serving costs

Let's look at the serving requirements:

PACER, which processed more than 500 million requests for case information last fiscal year
Gosh, that's a lot. What's that per second? 3600 seconds/hour x 24 hours/day x 365 days/year is 32 million seconds/year, so Judge Fleissig is talking about... 16 queries per second. Assume that's one query per page. That's laughably small.

Assume that peak traffic is 10x that, and you can serve comfortably 4 x 1MB pages per second on a 100Mbit network connection from a single machine; that's 40 machines with associated hardware, say amortized cost of $2,000/year per machine - implies order of $100K/year on hardware, to ensure a great user experience 24 hours per day 365 days per year. Compared to $100M/year budget, that's noise. And you can save 50% just by halving the number of machines and rejecting excess traffic at peak times.

The ingestion and storage costs

Perhaps the case ingestion is intrinsically expensive, with PACER having to handle non-standard formats? Nope:

The Judiciary is planning to change the technical standard for filing documents in the Case Management and Electronic Case Filing (CM/ECF) system from PDF to PDF/A. This change will improve the archiving and preservation of case-related documents.
So PACER ingests PDFs from courts - plus, I assume, some metadata - and serves PDFs to users.

How much data does PACER ingest and hold? This is a great Fermi question; here's a good worked example of answer, with some data.

There's a useful Ars Technica article on Aaron Swartz that gives us data on the document corpus as of 2013:

PACER has more than 500 million documents
Assume it's doubled as of 2019, that's 1 billion documents. Assume 1MB/page, 10 pages/doc, that's 10^9 docs x 10 MB per doc = 10^10 MB = 1x10^4 TB. That's 1000 x 10TB hard drives. Assume $300/drive, and drives last 3 years, and you need twice the number of drives to give redundancy, that's $200 per 10TB per year in storage costs, or $200K for 10,000 TB. Still, noise compared to $100M/year budget. But the operational costs of managing that storage can be high - which is why Cloud services like Amazon Web Services, Azure and Google Cloud have done a lot of work to offer managed services in this area.

Amazon, for instance, charges $0.023 per GB per month for storage (on one price model) - for 10^9 x 1MB docs, that's 1,000,000 GB x $0.023 or $23K/month, $276K/year. Still way less than 1% of the $100M/year budget.

Incidentally Aaron Swartz agrees with the general thrust of my article:

Yet PACER fee collections appear to have dramatically outstripped the cost of running the PACER system. PACER users paid about $120 million in 2012, thanks in part to a 25 percent fee hike announced in 2011. But Schultze says the judiciary's own figures show running PACER only costs around $20 million.
A rise in costs of 5x in 6 years? That's approximately doubling every 2 years. As noted above, it seems unlikely to be due to serving costs - even though volumes have risen, serving and storage costs have got cheaper. Bet it's down to personnel costs. I'd love to see the accounts break-down. How many people are they employing, and what are those people doing?

The indexing costs - or lack thereof

Indexing words and then searching a large corpus of text is notoriously expensive - that's what my 10c per electronic page is paying for, right? Apparently not:

There is a fee for retrieving and distributing case information for you: $30 for the search, plus $0.10 per page per document delivered electronically, up to 5 documents (30 page cap applies).
It appears that PACER is primarily constructed to deliver responses to "show me the records of case XXXYYY" or "show me all cases from court ZZZ", not "show me all cases that mention 'Britney Spears'." That's a perfectly valid decision but makes it rather hard to justify the operating costs.

Security considerations

Oh, please. These docs are open to anyone who has an account. The only thing PACER should be worried about is someone in Bangalore or Shanghai scraping the corpus, or the top N% of cases, and serving that content for much less cost. Indeed, that's why they got upset at Aaron Swartz. Honestly, though, the bulk of their users - law firms - are very price-insensitive. Indeed, they quite possibly charge their clients 125% or more of their PACER costs, so if PACER doubled costs overnight they'd celebrate.

I hope I'm wrong. I'm afraid I'm not.

Public serving alternatives

I don't know how much Bing costs to operate, but I'd bet a) that its document corpus is bigger than PACER, b) that its operating costs are comparable, c) that its indexing is better than PACER, d) that its search is better than PACER, e) that its page serving latency is better than PACER... you get the picture.

Really though, if I were looking for a system to replace this, I'd build off an off-the-shelf solution to translate inbound PDFs to indexed text - something like OpenText - and run a small serving stack on top. That reduces the regular serving cost, since pages are a few KB of text rather than 1MB of PDF, and lets me get rid of all the current people costs associated with the customized search and indexing work on the current corpus.

PACER is a terrible use of government money

Undoubtedly it's not the worst[3], but I'd love for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet to drag Jeff Bezos in to testify and ask him to quote a ballpark number for serving PACER off Amazon Web Services, with guaranteed 100% profit margin.

Bet it's less than 1/4 of the current $100M/year.

[1] Yes, irony
[2] Why does New Jersey have the most toxic waste dumps and California the most lawyers? California New Jersey got first choice. [Thanks Mr Worstall!]
[3] Which is terribly depressing.

2019-09-22

Deconstructing Dr Rachel McKinnon

Those of my readers who are keen followers of trans rights issues - likely none - may be aware of the controversy surrounding Dr Rachel McKinnon (person's preferred Twitter handle) who is a man who identifies as a woman ("trans woman"). McKinnon was previously an OK-but-far-from-top-tier cyclist in the men's arena. Upon "becoming" a woman, McKinnon quickly powered to the top ranking, including a win in the UCI Masters Track Cycling World Championship in the 35-44 age group (female), and if you click through to that link you might have an inkling why.

McKinnon has been assiduous on chasing down (and blocking) anyone on Twitter who questions the fairness of a physiological man competing with physiological females. I can't imagine why, unless there's a certain element of feeling guilty about sudden un-earned success.

Luckily, the golden fountain of academic publishing has provided a definitive voice on the subject[1]. McKinnon has published a paper (co-authored with Dr. Aryn Conrad) in PHILOSOPHICAL TOPICS, VOL. 46, NO. 2, FALL 2018 which settles the issue once and for all. [Rachel, FYI, I've squirrelled away a copy of this in case you delete it.]

Aryn Conrad, if you were wondering, also appears not to have been born in the same gender to which they now identify. Apparently Aryn is "the granddaughter of Mexican immigrants" though I wonder whether that's exactly the same relationship that the grandparents would state.

Let's take a walk through this article. The abstract sets out their goal:

We argue that the inclusion of trans athletes in competition commensurate with their legal gender is the most consistent position with these principles of fair and equitable sport.
Gosh, that's not something we could have predicted, at all. But perhaps we're being unfair, what's the actual argument? Well:
We suggest that the justificatory burden for such prima facie discrimination [endogenous testosterone limits] is unlikely to be met. Thus, in place of a limit on endogenous testosterone for women (whether cisgender, transgender, or intersex), we argue that ‘legally recognized gender’ is most fully in line with IOC and CAS principles.
In other words, it doesn't matter if trans athletes have a material physiological advantage over women, the paper wants to talk about whether the existing regulations are fully consistent with respect to the issues of male-to-female athletes. This approach is certain to win over female athletes on the lower steps of the podium, of course.

It's a poor quality "paper", by the way; 61 double-spaced pages without diagrams before you get to the appendices, so about 30 normal pages. Contrary to what aspiring academics might think, length is generally inversely proportional to quality. If you can't make the core argument in 10 pages, you're probably relying on length to cover up plot holes. It also doesn't follow the usual structure of "tell 'em what you're going to say, say it, tell them that you've said it" - perhaps because that would make it much easier to check their claims.

Reading through the paper, the key claims are:

  1. Internation sport regulations, and their legal effect and scope, are complicated;
  2. There are some edge cases of people born as women with high testosterone, which have not been handled consistently;
  3. Sport regs say we must not discriminate on various grounds - is "gender identity" (as opposed to biological sex) one of those grounds? (you'll be shocked to learn that the authors think that it is);
  4. Apparently not clear that biological women with excess testosterone have a significant physiological advantage over other women;
  5. What is the meaning of "fairness / level playing field" in sport? There's a huge amount of waffle here, but seems to boil down to "gender identity is intrinsic, so you can't base fairness on it in the same way that you can't say that a 7 foot tall person has an unfair advantage in the high jump". (I'm doing some serious editing here, the text is sprawling and terribly structured and summarised).

At this point I'd like to pull out the quote:

So if trans women are female, we ask, why would 'male' physiological data be relevant to the question of fairness? We know this won’t be convincing. But it is an important question to confront.
Well, there's the tiny matter that male physiology is hugely relevant to performance in sport."We know this won't be convincing" - yes, because it is not at all convincing. It is, if you excuse the phrase in this context, bollocks.

Continuing, we have:

  1. Placing upper limits on testosterone in "women" is totes unfair;
  2. Trans women's physiological advantage is not that big, in fact men and women almost completely overlap in physiology (I swear, that's what they say);
  3. Trans women are actually just like regular hi-testosterone women in sports performance;
  4. Indeed, setting testosterone limits on women in sport is probably unfair and unreasonable;
  5. Bodies are complex and testosterone levels are not the whole story by a long chalk;
  6. Testosterone levels don't seem correlated with performance by elite men;
  7. Actually, just don't use testosterone to judge who's a man and who's a woman - just take their word for it;
  8. Let's look at Caster Semenya as an edge case of high-performing woman with testosterone, trans women are totes the same as her
  9. If you don't let men identifying as women compete in women-only events, it's just not fair dammit.
My goodness me. I'm glad I only had to read that once. If I were designing a paper structure to bury the facts and specific arguments, I don't think I could have done better. Props to McKinnon and Conrad. Of course, if they were actually trying to convince rather than write an obscure scrawl to point to as "academic validation of our argument, baby!" they'd have written it differently.

I don't know who the reviewers were for this paper, but if they got any remuneration then I'd recommend clawing it back, sharp-ish.

Rachel and Aryn: if you want to submit a more compact version of the paper to a journal with standards on conciseness, you're welcome to build from the above structure. I don't want any co-author credit because I think your arguments are ludicrous, but I'd like to see them at least argued clearly.

Rachel McKinnon and Aryn Conrad appear to be desperate to get external validation for their lifestyle choices. I'm reminded at this point of Robert Pirsig's comment in "Zen and the Art of Motorcycle Maintenance":

You are never dedicated to something you have complete confidence in. No one is fanatically shouting that the sun is going to rise tomorrow. They know it's going to rise tomorrow. When people are fanatically dedicated to political or religious faiths or any other kinds of dogmas or goals, it's always because these dogmas or goals are in doubt.

[1] No, not really.