-
Notifications
You must be signed in to change notification settings - Fork 6
Preprocessing language+mathematics corpora for pretraining #59
Comments
Maybe add: proofwiki? |
Thanks @holtzermann17 , interesting suggestion. I will also "audit" that resource. I've added the link to their latest XML dump in the issue description and downloading now. Doing a gradual step-by-step auditing, preprocessing and packaging into pretrain-friendly versions in the coming weeks. So please feel free (and anyone else reading here!) to link me to any other large-ish resources with math syntax that feel similar to the list above. And thanks! |
Here is an unfortunate example of how s2orc deals with mathematical formulas, a fundamental limitation of scraping PDFs. You'll see scripts silently put on the baseline, badly tokenized paragraphs when there is display math, and of course missing markup. A non-starter in my eyes. Obtained from their sample.jsonl file. {
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"The choice of the function S(x, t) depends on the weights and sup x\u2208\u2126 u 0 (x).",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"Proof. Let S = p + q with p and q such that",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"where r = |x| = x 2 1 + x 2 2 1/2 . The positive constants A, a and \u03ba will be chosen later. Then p and q satisfy",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"If a satisfies ",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"Then, by the choice of \u03ba, k and a, we see that S = p + q satisfies",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"and by the inequality (2.6), we obtain for z \u2208 \u2202\u2126,",
"cite_spans":[],"ref_spans":[]},
{
"section":"S(z, t) = p(z, t) + q(z, t) =",
"text":"A e \u03bat + e \u03bat e \u22122a > Ae \u03bat > f (z, \u00b7) 2 ( p(\u00b7, t) 2 + q(\u00b7, t) 2 ) \u2265 \u2126 f(z, y)(p + q) dy = \u2126 f (z, y",
"cite_spans":[],"ref_spans":[]},
{
"section":")S(y, t) dy, 0 < t < T.",
"text":"Note that above inequalities hold for arbitrary positive constant A. After choosing a and \u03ba, let A satisfy 2Ae \u2212a > sup x\u2208\u2126 u 0 (x). Then one has",
"cite_spans":[],"ref_spans":[]},
{
"section":")S(y, t) dy, 0 < t < T.",
"text":"Hence S(x, t) is a supersolution to (1.1), and thus inequality (2.3) holds by Theorem 1.1.",
"cite_spans":[],"ref_spans":[]},
{
"section":")S(y, t) dy, 0 < t < T.",
"text":"By Theorem 2.1, we have a supersolution for any T > 0. Hence the local solution u on D T from Theorem 1.2 is bounded in D T for arbitrary T > 0, and thus u can be extended to the whole time domain.",
"cite_spans":[],"ref_spans":[]},
{
"section":"Decreasing property of boundary values.",
"text":"In this section, boundary behavior of the solution to Problem (1.1) is studied. The difference of largest and smallest boundary values grows exponentially (inequality (3.4) ). When the weights are identically zero on some part of boundary, it is shown that the difference can be nonincreasing in Theorem 3.2.",
"cite_spans":[],"ref_spans":[]} |
Preprocessing fine-print for approaching the arXMLiv 2020 and Stackexchange (kiwix 2020) sets. The usual path has been:
Pretraining an LM is proudly robust to noise and there shouldn't be much need for overthinking the details here. Getting a sane basic setup, that boosts the core textual English + math syntax signal, should be a priority for our purposes. A range of other content can get dropped. Mostly asking the questions:
|
A good 2020 use of llamapun would be to use it as a unified preprocessing step for a variety of HTML corpora which also include math syntax by one trick or another. The goal would be to do the legwork on a variety of HTML dialects so that we get clean and maximally denoised data as a plaintext target, with a primary focus on using that textual form for pretraining a neural language model. I will be using this issue as a documentation placeholder for the various targets I have in mind.
What are we looking for?
I will re-edit this description to include corpora I think I can include for the current pass.
Decided to include (data has been obtained locally, checked when preprocessing is completed):
alttext
attr)- non-Math sets also included: {academia, chess, ebooks, english, history, law, linguistics, literature, money, patents, philosophy, writers}
latest.xml
, as it is already standard HTMLalt=
attributes in math images, needed to lemmatize.To vet:
Vetted, but currently excluded:
=
in the texts, 16,000 have a+
. So may be worth including as an extra source for light inline math.The text was updated successfully, but these errors were encountered: