Preprocessing language+mathematics corpora for pretraining #59

dginev · 2020-11-25T14:36:06Z

A good 2020 use of llamapun would be to use it as a unified preprocessing step for a variety of HTML corpora which also include math syntax by one trick or another. The goal would be to do the legwork on a variety of HTML dialects so that we get clean and maximally denoised data as a plaintext target, with a primary focus on using that textual form for pretraining a neural language model. I will be using this issue as a documentation placeholder for the various targets I have in mind.

What are we looking for?

primary textual sources (rather than remixed/curated train sets from other experiments)
openly available for download & research
processable math syntax that we can reliably normalize and lexematize
interesting exceptions: e.g. consider synthetic datasets that offer diverse examples of math syntax use, and/or posing problems

I will re-edit this description to include corpora I think I can include for the current pass.

Decided to include (data has been obtained locally, checked when preprocessing is completed):

To vet:

LearningQ
Berkley MATH + AMPS dataset
openstax
wikia math & physics problems
science blogs: sciencealert, sciencebuddies, symmetry magazine, ...
educational texts - Introduction to Proofs

Vetted, but currently excluded:

The text was updated successfully, but these errors were encountered:

holtzermann17 · 2020-11-28T19:41:45Z

Maybe add: proofwiki?

dginev · 2020-11-28T19:57:03Z

Thanks @holtzermann17 , interesting suggestion. I will also "audit" that resource. I've added the link to their latest XML dump in the issue description and downloading now. Doing a gradual step-by-step auditing, preprocessing and packaging into pretrain-friendly versions in the coming weeks.

So please feel free (and anyone else reading here!) to link me to any other large-ish resources with math syntax that feel similar to the list above. And thanks!

dginev · 2020-11-29T02:59:51Z

Here is an unfortunate example of how s2orc deals with mathematical formulas, a fundamental limitation of scraping PDFs. You'll see scripts silently put on the baseline, badly tokenized paragraphs when there is display math, and of course missing markup. A non-starter in my eyes.

Obtained from their sample.jsonl file.

{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"The choice of the function S(x, t) depends on the weights and sup x\u2208\u2126 u 0 (x).",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"Proof. Let S = p + q with p and q such that",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"where r = |x| = x 2 1 + x 2 2 1/2 . The positive constants A, a and \u03ba will be chosen later. Then p and q satisfy",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"If a satisfies ",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"Then, by the choice of \u03ba, k and a, we see that S = p + q satisfies",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"and by the inequality (2.6), we obtain for z \u2208 \u2202\u2126,",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"S(z, t) = p(z, t) + q(z, t) =",
   "text":"A e \u03bat + e \u03bat e \u22122a > Ae \u03bat > f (z, \u00b7) 2 ( p(\u00b7, t) 2 + q(\u00b7, t) 2 ) \u2265 \u2126 f(z, y)(p + q) dy = \u2126 f (z, y",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"Note that above inequalities hold for arbitrary positive constant A. After choosing a and \u03ba, let A satisfy 2Ae \u2212a > sup x\u2208\u2126 u 0 (x). Then one has",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"Hence S(x, t) is a supersolution to (1.1), and thus inequality (2.3) holds by Theorem 1.1.",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"By Theorem 2.1, we have a supersolution for any T > 0. Hence the local solution u on D T from Theorem 1.2 is bounded in D T for arbitrary T > 0, and thus u can be extended to the whole time domain.",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Decreasing property of boundary values.",
   "text":"In this section, boundary behavior of the solution to Problem (1.1) is studied. The difference of largest and smallest boundary values grows exponentially (inequality (3.4) ). When the weights are identically zero on some part of boundary, it is shown that the difference can be nonincreasing in Theorem 3.2.",
   "cite_spans":[],"ref_spans":[]}

dginev · 2020-12-20T01:26:10Z

Preprocessing fine-print for approaching the arXMLiv 2020 and Stackexchange (kiwix 2020) sets. The usual path has been:

HTML -> plaintext -> tfrecords

Pretraining an LM is proudly robust to noise and there shouldn't be much need for overthinking the details here. Getting a sane basic setup, that boosts the core textual English + math syntax signal, should be a priority for our purposes. A range of other content can get dropped.

Mostly asking the questions:

dginev · 2021-03-01T23:11:37Z

As I'm looking into the proofwiki data, cataloging some related links with further resources that I stumbled on:

Lists:

Resources:

... there's more to expand from here. I won't track them all down now, but it's encouraging to see there is such a wide variety of resources to compile data from, rich in math syntax.

Blog Roll

dginev added the enhancement label Nov 25, 2020

dginev mentioned this issue Jul 6, 2021

installation fails on Windows Subsystem for Linux #66

Closed

dginev mentioned this issue Sep 12, 2022

404 on SENNA links #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing language+mathematics corpora for pretraining #59

Preprocessing language+mathematics corpora for pretraining #59

dginev commented Nov 25, 2020 •

edited

Loading

holtzermann17 commented Nov 28, 2020

dginev commented Nov 28, 2020

dginev commented Nov 29, 2020 •

edited

Loading

dginev commented Dec 20, 2020 •

edited

Loading

dginev commented Mar 1, 2021 •

edited

Loading

Preprocessing language+mathematics corpora for pretraining #59

Preprocessing language+mathematics corpora for pretraining #59

Comments

dginev commented Nov 25, 2020 • edited Loading

holtzermann17 commented Nov 28, 2020

dginev commented Nov 28, 2020

dginev commented Nov 29, 2020 • edited Loading

dginev commented Dec 20, 2020 • edited Loading

dginev commented Mar 1, 2021 • edited Loading

Blog Roll

dginev commented Nov 25, 2020 •

edited

Loading

dginev commented Nov 29, 2020 •

edited

Loading

dginev commented Dec 20, 2020 •

edited

Loading

dginev commented Mar 1, 2021 •

edited

Loading