AI Sites Relying on Law Blogs and Open Law For Legal Data

April 20, 2023

Your AI – whether it be Google’s Bard, Open AI’s ChatGPT, or Microsoft’s Bing – is heavily relying on legal blogs and open law for its legal data.

This, per a piece this morning, from Veteran tech blogger and legal journalist, Robert Ambrogi.

Though AI is trained on large language models (LLM’s), little is known of the data, itself, on which the AI is trained.

Ambrogi reports though that The Washington Post has “lifted the cover off this black box.”

Working with the Allen Institute for AI, it analyzed Google’s C4 data set, “a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs,” including Google’s T5 and Facebook’s LLaMA.

It then categorized all of those websites (journalism, entertainment, etc.) and ranked them based on how many “tokens” appeared from each data set — with tokens being the bits of text used to process the disorganized information.”

Ambrogi found his own blog, LawSites, ranked 63,769 of all sites used to train the dataset.

Based on searches for words such as law, legal, court and case, Ambrogi found a number of prominent law legal blogs.

Law Professor Blogs Network, 1,655.
LexBlog, 110,534.
My Shingle, 164,557.
Legal Evolution, 194,595.

It’s interesting that though FindLaw, Justia and Casetext were at the top of the list, Thomson Reuters (175,911) and Bloomberg (11,209,960) were near the bottom of Ambrogi’s eighteen law sites used to train the dataset.

Tells me open law and insight and commentary on the law from established authorities may well be front and center of the law delivered by AI.