<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<!--<?xml-stylesheet type="text/xsl" href="article.xsl"?>-->
<article article-type="research-article" dtd-version="1.2" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="issn">2767-0279</journal-id>
<journal-title-group>
<journal-title>Glossa Psycholinguistics</journal-title>
</journal-title-group>
<issn pub-type="epub">2767-0279</issn>
<publisher>
<publisher-name>eScholarship Publishing</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5070/G6011190</article-id>
<article-categories>
<subj-group>
<subject>Regular article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A-maze of Natural Stories: Comprehension and surprisal in the Maze task</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Boyce</surname>
<given-names>Veronica</given-names>
</name>
<email>vboyce@stanford.edu</email>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Levy</surname>
<given-names>Roger P.</given-names>
</name>
<email>rplevy@mit.edu</email>
</contrib>
</contrib-group>
<aff id="aff-1"><label>1</label>Stanford University, US</aff>
<aff id="aff-2"><label>2</label>Massachusetts Institute of Technology, US</aff>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2023-04-06">
<day>06</day>
<month>04</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>2</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>34</lpage>
<permissions>
<copyright-statement>Copyright: &#x00A9; 2023 The Author(s)</copyright-statement>
<copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See <uri xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</uri>.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://glossapsycholinguistics.journalpub.escholarship.org/articles/10.5070/G6011190/"/>
<abstract>
<p>Behavioral measures of word-by-word reading time provide experimental evidence to test theories of language processing. A-maze is a recent method for measuring incremental sentence processing that can localize slowdowns related to syntactic ambiguities in individual sentences. We adapted A-maze for use on longer passages and tested it on the Natural Stories corpus. Participants were able to comprehend these longer text passages that they read via the Maze task. Moreover, the Maze task yielded useable reaction time data with word predictability effects that were linearly related to surprisal, the same pattern found with other incremental methods. Crucially, Maze reaction times show a tight relationship with properties of the current word, with little spillover of effects from previous words. This superior localization is an advantage of Maze compared with other methods. Overall, we expanded the scope of experimental materials, and thus theoretical questions, that can be studied with the Maze task.</p>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>1. Introduction</title>
<p>Two chief results of human language processing research are that comprehension is highly incremental and that comprehension difficulty is differential and localized. Incrementality in comprehension means that our minds do not wait for large stretches of linguistic input to accrue; rather, we eagerly analyze each moment of input and rapidly integrate it into context (<xref ref-type="bibr" rid="B47">Marslen-Wilson, 1975</xref>). Differential and localized processing difficulty means that different inputs in context present different processing demands during comprehension (<xref ref-type="bibr" rid="B38">Levy, 2008</xref>). Due to incrementality, these differential processing demands are, by and large, met relatively quickly by the mind once they are presented, and they can be measured in both brain (<xref ref-type="bibr" rid="B36">Kutas &amp; Hillyard, 1980</xref>; <xref ref-type="bibr" rid="B54">Osterhout &amp; Holcomb, 1992</xref>) and behavioral (<xref ref-type="bibr" rid="B51">Mitchell, 2004</xref>; <xref ref-type="bibr" rid="B59">Rayner, 1998</xref>) responses. These measurements often have low signal-to-noise ratio, and many methods require bringing participants into the lab and often require cumbersome equipment. However, they can provide considerable insight into how language processing unfolds in real time. Developing more sensitive methods that can easily be used with remote participants is thus of considerable interest.</p>
<p>Word-by-word reading or response times are among the most widely used behavioral measurements in language comprehension and give relatively direct insight into processing difficulty. The Maze task (<xref ref-type="bibr" rid="B20">Forster et al., 2009</xref>; <xref ref-type="bibr" rid="B22">Freedman &amp; Forster, 1985</xref>), which involves collecting participants&#8217; response times in a repeated two-alternative forced-choice between a word that fits the preceding linguistic context and a distractor that doesn&#8217;t, has recently been proposed as a high-sensitivity method that can easily be used remotely. Boyce et al. (<xref ref-type="bibr" rid="B8">2020</xref>) introduced several implementational innovations that made it easier for researchers to use Maze, and showed that for several controlled syntactic processing contrasts (<xref ref-type="bibr" rid="B77">Witzel et al., 2012</xref>) Maze offers better statistical power than self-paced reading, the other word-by-word response time method easy to use remotely. Maze has since had rapid uptake in the language processing community (<xref ref-type="bibr" rid="B12">Chac&#243;n et al., 2021</xref>; <xref ref-type="bibr" rid="B37">Levinson, 2022</xref>; <xref ref-type="bibr" rid="B43">Lieburg et al., 2022</xref>; <xref ref-type="bibr" rid="B53">Orth &amp; Yoshida, 2022</xref>; <xref ref-type="bibr" rid="B68">Ungerer, 2021</xref>).</p>
<p>There is increasing interest in collecting data during comprehension of more naturalistic materials such as stories and news articles (<xref ref-type="bibr" rid="B17">Demberg &amp; Keller, 2008</xref>; <xref ref-type="bibr" rid="B23">Futrell et al., 2020</xref>; <xref ref-type="bibr" rid="B45">Luke &amp; Christianson, 2016</xref>), which offer potentially improved ecological validity and larger scale data in comparison with repeated presentation of isolated sentences out of context. These more naturalistic materials require maintaining and integrating discourse dependencies and other types of information over longer stretches of time and linguistic material. Previous work leaves unclear whether the Maze task would be feasible for this purpose: the increased task demands might interfere with the demands presented by these more naturalistic materials, and vice versa. In this paper we report a new modification of the Maze task and show that it makes reading of extended, naturalistic texts feasible. We also analyze the resulting reaction time profiles and show that they provide strong signal regarding the probabilistic relationship between a word and the context in which it appears, and that the systematic linear relationship between word surprisal and response time observed in other reading paradigms (<xref ref-type="bibr" rid="B64">Smith &amp; Levy, 2013</xref>) also arises in the Maze task.</p>
<p>In the remainder of the introduction, we lay out the role of RT-based methods in theory testing, describe a few common methods, and review some key influences on reading time. We then proceed to present our modified &#8220;error-correction Maze&#8221; paradigm, our experiment, and the results of our analyses of the resulting data.</p>
<sec>
<title>1.1. Why measure RTs?</title>
<p>A major feature of human language processing is that not all sentences or utterances are equally easy to successfully comprehend. Sometimes this is mostly or entirely due to the linguistic structure of the sentence: for example, <italic>The rat that the cat that the dog chased killed ate the cheese</italic> is more difficult than <italic>The rat that was killed by the cat that was chased by the dog ate the cheese</italic> even though the meaning of the two sentences is (near-)identical. Sometimes the source of difficulty can be a mismatch between expectations set up by the context and the word choice in an utterance: for example, the question <italic>Is the cup red?</italic> may be confusing in a context containing more than one cup. Psycholinguistic theories may differ in their ability to predict what is easy and what is hard. One of the most powerful methods for studying these differential difficulty effects is to let the comprehender control the pace of presentation of the linguistic material, and to measure what she takes time on. For this purpose, taking measurements from experimental participants during reading, a widespread, highly practiced skill in diverse populations around the world, is of unparalleled value.</p>
<p>To a first approximation, everyday reading (when the reader&#8217;s goal is to understand a text&#8217;s overall content) is <italic>progressive</italic>: we read documents, paragraphs, and sentences from beginning to end. The reader encounters each word with the benefit of the preceding linguistic context. Incrementality in reading involves successively processing each word encountered and integrating it into the context. For a skilled reader experienced with the type of text being read, most words are easy enough that the subjective experience of reading the text is of smooth, continuously unfolding understanding as we construct a mental model of what is being described. But occasionally a word may be sufficiently surprising or otherwise difficult to reconcile with the context that it disrupts comprehension to the level of conscious awareness: in the sentence <italic>I take my coffee with cream and chamomile</italic>, for example, the last word is likely to do so. Behaviorally, this disruption typically manifests as a slowdown or longer <italic>reading time</italic> (RT) on the word itself, on the immediately following words, or in other forms such as regressive eye movements back to earlier parts of the text to check the context.</p>
<p>In fact, RTs and other measures that capture processing disruption vary substantially with the difficulty of words in their context below the level of conscious awareness as well, with millisecond scale differences in reading time between words. That is, the differential difficulty or processing load posed by various parts of a text is to a considerable extent <italic>localizable</italic> to specific words in their context. For this reason, RTs have proven a highly valuable measure for testing the predictions of psycholinguistic theory, ranging from theories of character recognition, memory retrieval, parsing, and beyond.</p>
<p>For instance, competing theories about why certain types of object-extracted relative clauses, like <italic>the lawyer that the banker irritated</italic>, are harder to understand than the corresponding subject-extracted relative clauses, like <italic>the lawyer that irritated the banker</italic>, make different predictions about which words are the loci of the overall difficulty and slower RTs associated with object relatives (<xref ref-type="bibr" rid="B67">Traxler et al., 2002</xref>). Dependency-locality theory (<xref ref-type="bibr" rid="B27">Grodner &amp; Gibson, 2005</xref>) predicts that the locus of difficulty on object relatives is on the verb of the embedded clause (<italic>irritated</italic> in <italic>the lawyer that the banker irritated</italic>). In contrast, surprisal theory predicts the locus of difficulty will be on the article <italic>the</italic> at the start of the embedded clause (<xref ref-type="bibr" rid="B66">Staub, 2010</xref>; <xref ref-type="bibr" rid="B70">Vani et al., 2021</xref>). RT measures can potentially also inform theories about the time course of processing (i.e. which steps are parallel versus serial, Bartek et al. (<xref ref-type="bibr" rid="B5">2011</xref>)) or the functional form of relationships between word characteristics and processing time (<xref ref-type="bibr" rid="B64">Smith &amp; Levy, 2013</xref>).</p>
<p>Some of these theories rely on being able to attribute processing slowdowns to a particular word. Determining that object relatives are overall slower that subject relatives is easy. Even an imprecise RT measure will determine that the same set of words in a different order took longer to read at a sentence level. However, many language processing theories make specific (and contrasting) predictions about which words in a sentence are harder to process. To adjudicate among these theories, we want methods that are <italic>well-localized</italic>, so it is easy to determine which word is responsible for an observed RT slow-down. Ideally, a longer RT on a word would be an indication of that word&#8217;s increased difficulty, and not the lingering signal of a prior word&#8217;s increased difficulty. When the signal isn&#8217;t localized, advanced analysis techniques may be required to disentangle the slow-downs (<xref ref-type="bibr" rid="B62">Shain &amp; Schuler, 2018</xref>).</p>
</sec>
<sec>
<title>1.2 Eye-tracking and self-paced reading</title>
<p>The two most commonly used behavioral methods for studying incremental language processing during reading are tracking eye movements and self-paced reading. While both of these have proven powerful and highly flexible, they both have important limitations as well.</p>
<p>In eye-tracking, participants read a text on a screen naturally, while their saccadic eye movements are recorded on a computer-connected camera that is calibrated so that the researcher can reconstruct with high precision where the participant&#8217;s gaze falls on the screen at all times (<xref ref-type="bibr" rid="B59">Rayner, 1998</xref>). These eye movements can be used to reconstruct various position-specific reading time measures such as <italic>gaze duration</italic> (the total amount of time the eyes spend on a word the first time it is fixated before saccading to a later word) and <italic>total viewing time</italic> (the total amount of time that the word is fixated). If the eyes skipped the word the first time it was approached to the left, the trial is generally excluded. Eye tracking data collected with state-of-the-art high-precision recording equipment offers relatively good signal-to-noise ratio, but the difficulty presented by a word can still <italic>spill over</italic> into reading measures on subsequent words, a dynamic that can make it hard to isolate the source of an effect of potential theoretical interest (<xref ref-type="bibr" rid="B21">Frazier &amp; Rayner, 1982</xref>; <xref ref-type="bibr" rid="B40">Levy et al., 2009</xref>; <xref ref-type="bibr" rid="B60">Rayner et al., 2004</xref>). Short words such as articles and pronouns are often not fixated directly which makes it harder to study the processing of these words with eye-tracking. Additionally, the equipment is expensive and data collection is laborious and must occur in-lab.</p>
<p>Self-paced reading (SPR) is a somewhat less natural paradigm in which the participant manually controls the visual presentation of the text by pressing a button (<xref ref-type="bibr" rid="B50">Mitchell, 1984</xref>). In its generally preferred variant, moving-window self-paced reading, words are revealed one at a time or one group at a time: every press of the button masks the currently presented word (group) and simultaneously reveals the next. The time spent between button presses is the unique RT measure for that word (group). Self-paced reading requires no special equipment and can be delivered remotely, but the measurements are noisier and even more prone to spillover (<xref ref-type="bibr" rid="B35">Koornneef &amp; van Berkum, 2006</xref>; <xref ref-type="bibr" rid="B46">MacDonald, 1993</xref>; <xref ref-type="bibr" rid="B64">Smith &amp; Levy, 2013</xref>).</p>
</sec>
<sec>
<title>1.3 Maze</title>
<p>The Maze task is an alternative method that is designed to increase localization at the expense of naturalness (<xref ref-type="bibr" rid="B20">Forster et al., 2009</xref>; <xref ref-type="bibr" rid="B22">Freedman &amp; Forster, 1985</xref>). In the Maze task, participants must repeatedly choose between two simultaneously presented options: a correct word that continues the sentence, and a distractor string which does not. Participants must choose the correct word, and their time to selection is treated as the reaction time, or RT. (We deliberately overload the abbreviation &#8220;RT&#8221; and use it for Maze reaction times as well as reading times from eye tracking and SPR, because the desirable properties of reading times turn out to hold for Maze reaction times as well.) Forster et al. (<xref ref-type="bibr" rid="B20">2009</xref>) introduced two versions of the Maze task: lexical &#8220;L&#8221;-maze where the distractors are non-word strings, and grammatical &#8220;G&#8221;-maze where the distractors are real words that don&#8217;t fit with the context of the sentence. In theory, participants must fully integrate each word into the sentence in order to confidently select it, which may require mentally reparsing previous material in order to allow the integration and selection of a disambiguating word. Forster et al. (<xref ref-type="bibr" rid="B20">2009</xref>) call this need for full integration &#8220;forced incremental sentence processing&#8221; in their title (p. 163) to distinguish from other incremental processing methods where words can be passively read before later committing to a parse. This idea of strong localization is supported by studies finding strongly localized effects for G-maze (<xref ref-type="bibr" rid="B8">Boyce et al., 2020</xref>; <xref ref-type="bibr" rid="B77">Witzel et al., 2012</xref>).</p>
<p>The Maze task has less face-validity than eye-tracking or even SPR; repeated forced-choice selections does not seem very similar to normal reading. Despite this, Forster et al. (<xref ref-type="bibr" rid="B20">2009</xref>) report that &#8220;At a phenomenological level, participants typically report that they feel as if they are reading the sentence relatively naturally and that the correct alternative seems to &#8220;leap out&#8221; at them, so that they do not have to inspect the incorrect alternative very carefully, if at all.&#8221; (p. 164). This suggests that the Maze task may rely on the same language processing facilities tapped into by other reading methods. Thus, using Maze may not be the best paradigm for studying the process of normal reading, but may be perfectly good or even superior for getting at underlying language processing.</p>
<p>However, G-maze materials are effort-intensive to construct because of the need to select infelicitous words as distractors for each spot of each sentence. This burdensome preparation may explain why the Maze task was not widely adopted. Boyce et al. (<xref ref-type="bibr" rid="B8">2020</xref>) demonstrated a way to automatically generate Maze distractors by using language models from Natural Language Processing to find words that are high surprisal in the context of the target sentence, and thus likely to be judged infelicitous by human readers. Boyce et al. (<xref ref-type="bibr" rid="B8">2020</xref>) call Maze with automatically generated distractors A-maze. In a comparison, A-maze distractors had similar results to the hand-generated G-maze distractors from Witzel et al. (<xref ref-type="bibr" rid="B77">2012</xref>) and A-maze outperformed L-maze and an SPR control in detecting and localizing expected slowdown effects. Sloggett et al. (<xref ref-type="bibr" rid="B63">2020</xref>) also found that A-maze and G-maze distractors yielded similar results on a disambiguation paradigm.</p>
<p>Another recent variant of the Maze task is interpolated I-maze, which uses a mix of real word distractors (generated via the A-maze process) and non-word distractors (<xref ref-type="bibr" rid="B70">Vani et al., 2021</xref>; <xref ref-type="bibr" rid="B75">Wilcox et al., 2021</xref>). The presence of real word distractors encourages close attention to the sentential context, while non-words can be used as distractors where the word in the sentence is itself ungrammatical or highly unexpected, and/or it is important that the predictability of the distractor in the context is perfectly well-balanced (at zero) across all experimental conditions.</p>
</sec>
<sec>
<title>1.4 Measuring localization: Frequency, length, and surprisal effects</title>
<p>Localized measures can be used to attribute processing difficulty to individual words; however, to determine if a method is localized requires knowing how hard the words were to process. One approach is to look at properties of words that are known to influence reading times across methods such as eye-tracking and SPR. Longer words and lower frequency words tend to take longer to process (<xref ref-type="bibr" rid="B33">Kliegl et al., 2004</xref>), as do less predictable words (<xref ref-type="bibr" rid="B60">Rayner et al., 2004</xref>).</p>
<p>A word can be unpredictable for a variety of reasons: it could be low frequency, semantically unexpected, the start of a low-frequency syntactic construction, or a word that disambiguates prior words to a less common parse. Many targeted effects of interest can thus be potentially accommodated theoretically as specific features that influence word predictability.<xref ref-type="fn" rid="n1">1</xref> Thus incremental processing methods that are sensitive to predictability are useful for testing linguistic theories that make predictions about what words are unexpected.</p>
<p>The overall predictability of a word in a context can be estimated using language models that are trained on large corpora of language to predict what word comes next in a sentence. A variety of pre-trained models exist, with varied internal architectures and training methods, but all of them generate measures of predictability. Predictability is often measured in bits of surprisal, which is the negative log probability of a word (<xref ref-type="bibr" rid="B29">Hale, 2001</xref>; <xref ref-type="bibr" rid="B38">Levy, 2008</xref>). 1 bit of surprisal means a word is expected to occur half the time, 2 bits is 1/4 of the time, etc.</p>
<p>The functional form of the relationship between RTs from eye-tracking and SPR studies and the predictability of the words is linear in terms of surprisal (<xref ref-type="bibr" rid="B26">Goodkind &amp; Bicknell, 2018</xref>; <xref ref-type="bibr" rid="B45">Luke &amp; Christianson, 2016</xref>; <xref ref-type="bibr" rid="B64">Smith &amp; Levy, 2013</xref>; <xref ref-type="bibr" rid="B74">Wilcox et al., 2020</xref>), even when two important context-invariant word features known to influence RTs, length and frequency, are controlled for. Predictability reliably correlates with reading time over a wide range of surprisals found in natural-sounding texts, not just for words that are extremely expected or unexpected (<xref ref-type="bibr" rid="B64">Smith &amp; Levy, 2013</xref>). If Maze RTs reflect the same processing as other methods, we expect to find a similar linear relationship with surprisal.</p>
<p>The role of word frequency is worth special note. Although the facilitative effect of word frequency on reading measures has been known for decades, this effect remains a theoretical puzzle. Sensitivity to a word&#8217;s probability in context can be derived from any of a number of optimization principles (<xref ref-type="bibr" rid="B39">Levy, 2013</xref>; <xref ref-type="bibr" rid="B64">Smith &amp; Levy, 2013</xref>), and word frequency is a context-insensitive estimate of word probability. However, it is not straightforward why this effect should exist in models in which a role is played by <italic>contextually-conditioned</italic> word probability, which reading and other language processing measures are known to be sensitive to: conditioning on context should render the context-insensitive probability estimate irrelevant for optimal processing. Smith &amp; Levy (<xref ref-type="bibr" rid="B65">2008</xref>) noted this puzzle and posited two alternative explanations. First is estimation error of conditional word probability in computationally implemented language models used at the time, in which case the role of frequency could diminish or disappear as measurements of surprisal improve (as they have over the past decade); consistent with this hypothesis is the recent report of Shain (<xref ref-type="bibr" rid="B61">2019</xref>) finding no word frequency effects in reading using more recent language models. Second, the effects of frequency (in addition to those of surprisal) could be the result of quick, heuristic responses to the words based on the more easily available unigram frequency before more fine-grained context-sensitive surprisal becomes available within real-time language processing mechanisms.</p>
</sec>
<sec>
<title>1.5 Current experiment</title>
<p>The Maze task has thus far primarily been used on constructed sentences focusing on targeted effects and not on the long naturalistic passages used to assess the relationship between RT and surprisal. We tested how A-maze performs on longer naturalistic corpora and compared it with self-paced reading (SPR), with the following main questions in mind:</p>
<list list-type="order">
<list-item><p>Do participants engage with these longer passages successfully with the A-maze task?</p></list-item>
<list-item><p>Is A-maze as sensitive a method as SPR for these longer passages?</p></list-item>
<list-item><p>What is the functional form between word surprisal and RT for the A-maze task?</p></list-item>
<list-item><p>Does A-maze have less spillover than SPR?</p></list-item>
<list-item><p>What types of context-driven expectations, as operationalized in competing computational language models, are deployed to determine RTs in A-maze and SPR?</p></list-item>
</list>
<p>We used the Natural Stories corpus (<xref ref-type="bibr" rid="B23">Futrell et al., 2020</xref>), a set of 10 passages designed to read fluently to native speakers. Each passage is roughly 1000 words long. The passages contain copious punctuation, quoted speech, proper nouns, and low frequency grammatical constructions. The corpus is accompanied by binary-choice comprehension questions, 6 per story, which we used to assess comprehension.</p>
<p>We tweaked the A-maze task to accommodate these longer passages and then had participants read the passages in the Maze. We compare our A-maze results with SPR data collected on the Natural Stories corpus by Futrell et al. (<xref ref-type="bibr" rid="B23">2020</xref>).</p>
</sec>
</sec>
<sec>
<title>2. Error-correction Maze</title>
<p>In order to support longer materials, we tweaked the Maze task, creating a new variant called error-correction Maze.</p>
<p>One of the benefits of the Maze task is that it forces incremental processing by having participants make an active choice about what the next word is. But what happens if they choose incorrectly? In the traditional Maze paradigm, any mistake ends the sentence, and the participant moves on to the next item (<xref ref-type="bibr" rid="B20">Forster et al., 2009</xref>). An advantage of this is that participants who contribute RT data are very likely to have understood the sentence up to that point. This contrasts with other methods, where determining whether participants are paying attention usually requires separate comprehension check questions, usually not used for Maze.</p>
<p>However, terminating sentences on errors means that we don&#8217;t have RTs for words after a participant makes a mistake in an item. In traditional G-maze tasks, with hand-crafted distractors and attentive participants, errors are rare and data loss is a small issue. However, this data loss can be worse with A-maze materials and crowd-sourced participants (<xref ref-type="bibr" rid="B8">Boyce et al., 2020</xref>). The high errors are likely from some combination of participants guessing randomly and from auto-generated distractors that in fact fit the sentence; as Boyce et al. (<xref ref-type="bibr" rid="B8">2020</xref>) noted, some distractors, especially early in the sentence, were problematic and caused considerable data loss.</p>
<p>The high error rates could be improved by auto-generating better distractors or hand-replacing problematic ones, but that does not solve the fundamental problem with long items. Well-chosen distractors and attentive participants reduce the error rate, but the error rate will still compound over long materials. For instance, with a 1% error rate, 86% of participants would complete each 15-word sentence, but only 61% would complete a 50-word vignette, and 13% would complete a 200-word passage. In order to run longer materials, we needed something to do when participants made a mistake other than terminate the entire item.</p>
<p>As a solution, we introduce an <italic>error-correction</italic> variant of Maze shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. When a participant makes an error, they see an error message and must try again to select the correct option, before continuing the sentence as normal. We make error-correction Maze available as an option in a modification of the Ibex Maze implementation introduced in Boyce et al. (<xref ref-type="bibr" rid="B8">2020</xref>) (<ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/vboyce/Ibex-with-Maze">https://github.com/vboyce/Ibex-with-Maze</ext-link>). The code records both the RT to the first click and also the total RT until the correct answer is selected as separate values.</p>
<fig id="F1">
<caption>
<p><bold>Figure 1:</bold> Schematic of error-correction Maze. A participant reads a sentence word by word, choosing the correct word at each time point (selections marked in blue ovals). When they make a mistake, an error message is displayed, so the participant can try again and continue with the sentence.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g1.png"/>
</fig>
<p>Error-correction Maze expands the types of materials that can be used with Maze to include arbitrarily long passages and cushions the impact of occasional problematic distractors. Error-correction Maze is a change in experimental procedure, and is independent of what types of distractors are used. This error-correction presentation is used here with A-maze, but could also be used with G-maze or I-maze.</p>
</sec>
<sec sec-type="methods">
<title>3. Methods</title>
<p>We constructed A-maze distractors for the Natural Stories corpus (<xref ref-type="bibr" rid="B23">Futrell et al., 2020</xref>) and recruited 100 crowd-sourced participants to each read a story in the error-correction Maze paradigm. The materials, data, and analysis code are all available at <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/vboyce/natural-stories-maze">https://github.com/vboyce/natural-stories-maze</ext-link>.</p>
<sec>
<title>3.1 Materials</title>
<p>We used the texts from the Natural Stories corpus (<xref ref-type="bibr" rid="B23">Futrell et al., 2020</xref>) and their corresponding comprehension questions. To familiarize participants with the task, we wrote a short practice passage and corresponding comprehension questions. See Appendix A for an excerpt of one of the stories.</p>
<p>To generate distractors, we first split the corpora up into sentences, and then ran the sentences through the A-maze generation process. We used an updated version of the codebase from Boyce et al. (<xref ref-type="bibr" rid="B8">2020</xref>) which had the capability to match the greater variety of punctuation present in the Natural Stories corpus (updated auto-generation code at <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/vboyce/Maze">https://github.com/vboyce/Maze</ext-link>). We took the auto-generated distractors as they were, without checking for quality.</p>
</sec>
<sec>
<title>3.2 Participants</title>
<p>We recruited 100 participants from Amazon Mechanical Turk in April 2020, and paid each participant $3.50 for roughly 20 minutes of work. We excluded data from those who did not report English as their native language, leaving 95 participants. After examining participants&#8217; performance on the task (see results for details), we excluded data from participants with less than 80% accuracy, removing participants whose behavior was consistent with random guessing. After this exclusion, 63 participants were left.</p>
</sec>
<sec>
<title>3.3 Procedure</title>
<p>Participants first gave their informed consent and saw task instructions. Then they read a short practice story in the Maze paradigm and answered 2 binary-choice practice comprehension questions, before reading one main story in the error-correction A-maze task. After the story, they answered 6 comprehension questions, commented on their experience, answered optional demographic questions, were debriefed, and were given a code to enter for payment. The experiment was implemented in Ibex (<ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/addrummond/ibex">https://github.com/addrummond/ibex</ext-link>).</p>
</sec>
<sec>
<title>3.4 Self-paced reading comparison</title>
<p>In addition to the texts, Futrell et al. (<xref ref-type="bibr" rid="B23">2020</xref>) released reading time data from a SPR study they ran in 2011. They recruited 181 participants from Amazon Mechanical Turk, most of whom read 5 of the stories. After reading each story, each participant answered 6 binary-choice comprehension questions. For comparability with A-maze, we analyze only the first story each participant read, and, in line with Futrell et al. (<xref ref-type="bibr" rid="B23">2020</xref>), exclude participants who got less than 5/6 of the comprehension questions correct, leaving 165 SPR participants.</p>
</sec>
<sec>
<title>3.5 SPR&#8211;Maze correlation</title>
<p>We compared the correlations between the Maze and SPR RTs to within-Maze and within-SPR correlations. For Maze, within each story, we split the data in half, randomly assigning subjects into two equal groups. Within each half, we calculated a per-word average RT for each word and then a per-sentence average RT across word averages. We calculated a within-Maze correlation between these two halves.</p>
<p>For this comparison, we downsampled the SPR data choosing a number of participants equal to the number we have for Maze to avoid differences due to dataset size. We then used the same split-half procedure to get a within-SPR correlation. For between Maze-SPR correlation, we took the average correlation across each of the 4 pairs of Maze half and SPR half.</p>
</sec>
<sec>
<title>3.6 Modeling approach</title>
<p>Our analytic questions required multiple modeling approaches. To look at the functional form of the relationship between surprisal and RT data, we fit Generalized Additive Models (GAMs) to allow for non-linear relationships (<xref ref-type="bibr" rid="B81">Wood, 2017</xref>). GAM model summaries can be harder to interpret than those for linear models, so to measure effect sizes and assess spillover, we used linear mixed models. Finally, in order to determine which language model best predicts the RT data, we fit additional linear models with predictors from multiple language models to look at their relative contributions. All these models used surprisal, frequency, and length as predictors for RT. We considered these predictors from both the current and past word to account for the possibility of spillover effects in A-maze. For SPR comparisons, we included predictors from the current and past three words to account for known spillover effects. We conducted data processing and analyses using R Version 4.2.2 (<xref ref-type="bibr" rid="B57">R Core Team, 2022</xref>).<xref ref-type="fn" rid="n2">2</xref></p>
<sec>
<title>3.6.1 Predictors</title>
<p>We created a set of predictor variables of frequency, word length, and surprisals from 4 language models. For length, we used the length in characters excluding end punctuation. For unigram frequency, we tokenized the training data from Gulordava et al. (<xref ref-type="bibr" rid="B28">2018</xref>) and tallied up instances. We then rescaled the word counts to get the log2 frequency of occurrences per 1 billion words, so higher values indicate higher log frequencies. We got per-word surprisals for each of 4 different language models, covering a range of common architectures: a Kneser-Ney smoothed 5-gram; the long short-term memory recurrent neural network model of Gulordava et al. (<xref ref-type="bibr" rid="B28">2018</xref>), which we refer to as GRNN; Transformer-XL (<xref ref-type="bibr" rid="B16">Dai et al., 2019</xref>); and GPT-2 (<xref ref-type="bibr" rid="B58">Radford et al., n.d.</xref>), using lm-zoo (<xref ref-type="bibr" rid="B25">Gauthier et al., 2020</xref>). For all of these predictors, we used both the predictor at the current word as well as lagged predictors from the previous word.</p>
</sec>
<sec>
<title>3.6.2 Exclusions</title>
<p>In the Maze task, the first word of every sentence is paired with a nonce (x-x-x) distractor rather than a real word (as there is no context to use to distinguish between real words); due to this difference, we excluded the first word of every sentence, leaving 9782 words. We excluded words for which we didn&#8217;t have surprisal or frequency information, leaving 8489 words. We additionally excluded words that any model treated as being composed of multiple tokens (primarily words with punctuation), leaving 7512 words.<xref ref-type="fn" rid="n3">3</xref> We excluded outlier RTs that were &lt;100 or &gt;5000 ms (&lt;100 is likely a recording error, &gt;5000 is likely the participant getting distracted). We exclude RTs from words where mistakes occurred or which occurred after a mistake in the same sentence. We only analyzed words where we had values for all predictors, which meant that if the previous word was unknown to a model, the word was excluded because of missing values for a lagged predictor.</p>
</sec>
<sec>
<title>3.6.3 Model specification</title>
<p>To infer the shape of the relationship between our predictor variables and RTs, we fit generalized additive models (GAMs) using <monospace>R&#8217;s mgcv</monospace> package to predict the mean RT (after exclusions) for each word, averaging across participants from whom we obtained an unexcluded RT for that word. We centered but did not rescale the length and frequency predictors, and left surprisal uncentered for interpretability. We used smooth terms (<monospace>mgcv&#8217;s s()</monospace>) for surprisal and tensor product terms (<monospace>mgcv&#8217;s ti()</monospace>) for frequency-by-length effects and interactions. We use restricted maximum likelihood (REML) smoothing for parameter estimation. To more fully account for the uncertainty in the smoothing parameter estimates, we fit 101 bootstrap replicates of each GAM model; in <xref ref-type="fig" rid="F4">Figures 4</xref> and <xref ref-type="fig" rid="F5">5</xref>, the best-fit lines derive from the mean estimated effect size across the bootstrap replicates, and the shaded areas indicate a 95% bootstrap confidence interval on this effect size (the boundaries are the 2.5% and 97.5% quantiles of the bootstrapped replicates).</p>
<p>For linear models, we centered all predictors. We modeled the main effects of surprisal, length, and frequency as well as surprisal-by-length and frequency-by-length interactions. For the A-maze data, we used maximal mixed effects, including by-subject slopes and a per-word-token random intercept (<xref ref-type="bibr" rid="B4">Barr et al., 2013</xref>). We used weak priors (normal(1000,1000) for intercept, normal(0,500) for beta and sd, and lkj(1) for correlations) and ran models with brm (<xref ref-type="bibr" rid="B10">B&#252;rkner, 2018</xref>).</p>
<p>For linear models of the SPR data, we were unable to fit a single model whose random effects structure was maximal with respect to all fixed-effects predictors. We report results for the best (in terms of having maximal random effects structure with respect to fixed effects of primary theoretical interest) single model we could fit: by-subject random intercept, uncorrelated by-subject random slopes for surprisal, length and frequency, and a per-word-token random intercept, fit with lme4 (<xref ref-type="bibr" rid="B6">Bates et al., 2015</xref>), as this model specification did not fit reliably in brm.</p>
<p>For model comparisons, we took by-item averaged data to aid in fast model fitting. We included frequency, length, and their interaction in all models. Then we fit simple linear regression models (using <monospace>R&#8217;s lm()</monospace>) with either 1 or 2 sources of surprisal and assessed the effect of adding the second surprisal source with an F test (using <monospace>R&#8217;s anova()</monospace>).</p>
</sec>
</sec>
</sec>
<sec>
<title>4. Results</title>
<sec>
<title>4.1 Do participants engage successfully?</title>
<p>Our first question was whether participants could engage successfully with the error-correction Maze task. We assessed engagement by looking at participants&#8217; accuracy on the Maze task and performance on the comprehension questions.</p>
<p>Accuracy, or how often a participant chose the correct word over the distractor, reflects both the quality of the distractors and the focus and skill of the participant. We calculated the per-word accuracy rate for each participant and compared it against their average reaction time.<xref ref-type="fn" rid="n4">4</xref> As seen in <xref ref-type="fig" rid="F2">Figure 2A</xref>, one cluster of participants (marked in green) made relatively few errors, with some reaching 99% accuracy. This high performance confirms that the distractors were generally appropriate and shows that some participants maintained focus on the task for the whole story. These careful participants took around 1 second for each word selection, which is much slower than in eye-tracking or SPR.</p>
<fig id="F2">
<caption>
<p><bold>Figure 2:</bold> A. Accuracy on the Maze task (fraction of words selected correctly) versus their average reaction time (in ms). Many participants (marked in green) chose the correct word &gt;80% of the time; others (in red) appear to be randomly guessing. B. Performance on the comprehension questions. Participants with low accuracy performed poorly on comprehension questions; Participants with &gt;80% task accuracy tended to do well; their performance was roughly comparable to the performance of SPR participants from Futrell et al. (<xref ref-type="bibr" rid="B23">2020</xref>) on their first stories.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g2.png"/>
</fig>
<p>Another cluster of participants (in red) sped through the task, seemingly clicking randomly. This bimodal distribution is likely due to the mix of workers on Mechanical Turk, as we did not use qualification cutoffs. We believe the high level of random guessing is an artifact of the subject population (<xref ref-type="bibr" rid="B30">Hauser et al., 2018</xref>), and we expect that following current recommendations for participant recruitment, such as using qualification cutoffs or another recruitment site, would result in fewer participants answering randomly (<xref ref-type="bibr" rid="B18">Eyal et al., 2021</xref>; <xref ref-type="bibr" rid="B56">Peer et al., 2017</xref>).</p>
<p>To determine comprehension accuracy, we counted how many of the binary-choice comprehension questions each participant got right (out of 6). As seen in <xref ref-type="fig" rid="F2">Figure 2B</xref>, most participants who were accurate on the task also did well on comprehension questions, while participants who were at chance on the task were also at chance on the comprehension questions. Participants usually answered quickly (within 10 seconds), so we do not believe they were looking up the answers on the Internet. We can&#8217;t rule out that some participants may have been able to guess the answers without understanding the story. Nonetheless, the accurate answers provide preliminary evidence that people can understand and remember details of stories they read during the Maze task.</p>
<p>The comprehension question performance of accurate Maze participants is broadly similar to the performance of SPR participants from Futrell et al. (<xref ref-type="bibr" rid="B23">2020</xref>) on the first story read. Overall, 60% of Maze participants got 5 or 6 questions right (22% of low-accuracy participants and 79% of high-accuracy participants) compared to 91% of all SPR reads and 83% of 1st SPR reads. These differences cannot necessarily be attributed to methods, as the participant populations differed. While both studies were conducted on Mturk, the quality of Mturk data has decreased from 2011 when the SPR was collected to 2020 when the A-maze was collected (<xref ref-type="bibr" rid="B14">Chmielewski &amp; Kucker, 2020</xref>).</p>
<p>For the remainder of the analyses, we use task performance as our exclusion metric for A-maze because it is more fine-grained and only analyze data from participants with at least 80% accuracy (in the gap between high-performers and low-performers). For the SPR comparison, we follow Futrell et al. (<xref ref-type="bibr" rid="B23">2020</xref>)&#8217;s criteria and exclude participants who got less than 5 of the comprehension questions correct.</p>
</sec>
<sec>
<title>4.2 How do A-maze and SPR compare in sensitivity?</title>
<p>Our second question was a comparison of the sensitivity (the signal-to-noise ratio) of A-maze and SPR. To assess sensitivity, we conducted split-half comparisons looking at the correlations between and within SPR and A-maze. If the methods picked up on the same effects, we would expect them to be correlated, with sentences that took longer to read in one method also taking longer in the other. We calculated the average RT at the sentence level to reduce variability from spillover patterns. The correlation between Maze and SPR was 0.25, compared to 0.23 within SPR and 0.36 within Maze. See <xref ref-type="fig" rid="F3">Figure 3</xref> for a visual comparison of overall Maze versus SPR RTs. SPR data is about as correlated with Maze as with another sample of SPR data which provides some evidence that Maze and SPR are measuring the same effects. The superior within-method split-half correlation we see for Maze relative to SPR, despite the smaller number of participants, suggests that it is the more sensitive of the two methods (higher signal-to-noise ratio), consistent with the findings of Boyce et al. (<xref ref-type="bibr" rid="B8">2020</xref>) for factorial experimental designs with isolated-sentence presentation.</p>
<fig id="F3">
<caption>
<p><bold>Figure 3:</bold> Correlation between SPR and Maze data. RTs (measured in milliseconds) were averaged across participants per word and then averaged together within each sentence, so that each point represents the average RT in the two methods for one sentence in the corpus. Presented on a fixed scale coordinate system where 1 millisecond of RT takes equal physical space on both axes. Line and confidence interval reflect best linear fit regression of SPR time against Maze time.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g3.png"/>
</fig>
<fig id="F4">
<caption>
<p><bold>Figure 4:</bold> GAM results for the effect of current word surprisal (top) or previous word surprisal (bottom) on Maze reaction time (RT). Density of data is shown along the x-axis. The best-fit lines are from the mean estimated effect size across the bootstrap replicates, and the shaded areas indicate a 95% bootstrap confidence interval on this effect size. For each of the 4 language models used, there is a linear relationship between current word surprisal and RT. The relationship between previous word surprisal and RT is much flatter.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g4.png"/>
</fig>
<fig id="F5">
<caption>
<p><bold>Figure 5:</bold> GAM results for the effect of current word surprisal (top) or the surprisal of an earlier word, up to 3 words back on SPR RT data (<xref ref-type="bibr" rid="B23">Futrell et al., 2020</xref>). Density of data is shown along the x-axis. The best-fit lines are from the mean estimated effect size across the bootstrap replicates, and the shaded areas indicate a 95% bootstrap confidence interval on this effect size.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g5.png"/>
</fig>
</sec>
<sec>
<title>4.3 Are the effects of surprisal linear?</title>
<p>We next considered the relationship between surprisal and Maze RT. Surprisal, a measure of overall word predictability in context, is linearly related to RT in eye-tracking and SPR (<xref ref-type="bibr" rid="B26">Goodkind &amp; Bicknell, 2018</xref>; <xref ref-type="bibr" rid="B45">Luke &amp; Christianson, 2016</xref>; <xref ref-type="bibr" rid="B64">Smith &amp; Levy, 2013</xref>; <xref ref-type="bibr" rid="B74">Wilcox et al., 2020</xref>). If Maze is measuring the same language processes, we would expect to see a linear relationship between surprisal and Maze RT.</p>
<p>To assess the shape of the RT-surprisal relationship, we then fit generalized additive models (GAMs).<xref ref-type="fn" rid="n5">5</xref> For these models, we only included data that occurred before any mistakes in the sentence; due to limits of model vocabulary, words with punctuation and some uncommon or proper nouns were excluded. We used surprisals generated by 4 different language models for robustness. (See methods for details on language models, exclusions, and model fit.)</p>
<p>The main effects of current and previous word surprisals on RT are shown in <xref ref-type="fig" rid="F4">Figure 4</xref>. Note that for each of the models, high-surprisal words are rare, with much of the data from words with between 0 and 15 bits of surprisal. All 4 models show a roughly linear relationship between current word surprisal and RT, especially in the region with more data. To determine the goodness of fit of a model in which word probability effects on RT are taken to be linear in surprisal, we also fit GAM models with both parametric linear and nonparametric non-linear terms for surprisal; for all but the 5-gram model, these analyses supported a linear effect of surprisal (Appendix D).</p>
<p>As a comparison, we also ran GAMs on the SPR data collected by Futrell et al. (<xref ref-type="bibr" rid="B23">2020</xref>). Previous work such as Smith &amp; Levy (<xref ref-type="bibr" rid="B64">2013</xref>) has found positive relationships between RT and the surprisal of earlier words for SPR, so we include predictors from the current and the 3 prior words. The relationship between surprisals and RT is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>; note that the y-axis range is much narrower than for Maze. Both current and previous word surprisals have a roughly linear positive relationship to RT. The surprisal of the word two back also has an influence in some models.</p>
<p>Comparing Maze and SPR, we see that both show a linear relationship, but Maze has much larger effects of surprisal on the current word.</p>
</sec>
<sec>
<title>4.4 Does A-maze have less spillover?</title>
<p>One of the main claimed advantages of the Maze task is that it has better localization and less spillover than SPR. We examined how much spillover A-maze and SPR each had by fitting linear models with predictors from current and previous words. Large effects from previous words are evidence for spillover; effects of the current word dwarfing any lagged effects would be evidence for localization.</p>
<p>We modeled reading time as a function of surprisal, frequency, and length as well as surprisal &#215; length and frequency &#215; length interactions. For all of these, we included the predictors for the current and previous word, and we centered, but did not rescale, all predictors. (See methods for more details on these predictors and model fit process.) As with the GAM models, we used surprisal calculations from 4 different language models for robustness.</p>
<p>The Maze linear model effects are shown in <xref ref-type="fig" rid="F6">Figure 6</xref> (see also Appendix B for a table of effects). Across all models, there were consistent large effects of length and surprisal at the current word, but minimal effects of frequency. This lack of frequency effects differs from the results usually reported for SPR and eye-tracking (though see <xref ref-type="bibr" rid="B61">Shain, 2019</xref>). There was a small interaction between surprisal and length at the current word.</p>
<fig id="F6">
<caption>
<p><bold>Figure 6:</bold> Point estimates and 95% credible intervals for coefficients predicted by fitted Bayesian regression models predicting A-maze RT. Units are in ms. Surprisal is per bit, length per character, and frequency per <italic>log</italic><sub>2</sub> occurrence per billion words.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g6.png"/>
</fig>
<p>Crucially, the effects of previous word predictors are close to zero, and much smaller than the effects of surprisal and length of the current word, an indication that spillover is limited and effects are strongly localized.</p>
<p>We ran similar models for SPR, although to account for known spillover effects, we consider predictors from the current and 3 previous words. Due to issues fitting models, the details of the models differed (see methods). The SPR coefficients are shown in <xref ref-type="fig" rid="F7">Figure 7</xref> (see also Appendix B for a table of coefficients). Surprisal, length, and frequency effects are all evident for the current word and surprisal and frequency show effects from the previous word as well. Unlike for Maze, with SPR there is not a clear diminishing of the size of the effects as one goes from current word to prior word predictors.</p>
<fig id="F7">
<caption>
<p><bold>Figure 7:</bold> Point estimates and 95% confidence intervals (+/&#8211;1.97 standard error) for coefficients predicted by fitted regression models predicting SPR RT. Units are in ms. Surprisal is per bit, length per character, and frequency per <italic>log</italic><sub>2</sub> occurrence per billion words.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g7.png"/>
</fig>
<p>Whereas Maze showed surprisal effects in the 10 to 25 ms/bit range and length effects in the 15 to 20 ms/character range, SPR effects are about 1 to 2 ms per bit or character. This difference in effect size is disproportionate to the overall speed of the methods; the predicted intercept for the Maze task was roughly 880 ms and for SPR was roughly 360 ms. Thus Maze is 2&#8211;3 times as slow as SPR but has roughly 10 times larger effects.</p>
</sec>
<sec>
<title>4.5 Which language model fits best?</title>
<p>Our last analysis question is whether some of the language models fit the human RT data better than others. We assessed each model&#8217;s fit to A-maze data using log likelihood and R-squared. Then we did a nested model comparison, looking at whether a model with multiple surprisal predictors (ex. GRNN and GPT-2) had a better fit than a model with only one (ex. GRNN alone).</p>
<p>As shown in <xref ref-type="table" rid="T1">Table 1</xref>, GPT-2 provides a lot of additional predictive value over each other model, GRNN provides a lot over 5-gram and Transformer-XL and a little complementary information over GPT-2. Transformer-XL provides a lot over 5-gram, and 5-gram provides little over any model. The single-model measures of log likelihood confirm this hierarchy, as GPT-2 is better than GRNN is better than Transformer-XL is better than 5-gram.</p>
<table-wrap id="T1">
<caption>
<p><bold>Table 1:</bold> Results of model comparisons on Maze data. Each row shows the additional predictive value gained from adding that model to another model. F values and p values from ANOVA tests between 1-surprisal-source and 2-source models are reported. We also report log likelihoods of models with only one surprisal source and the r-squared correlation between the model&#8217;s predictions and the data.</p>
</caption>
<table>
<thead>
<tr>
<td align="left" valign="top"><bold>Model</bold></td>
<td align="left" valign="top"><bold>over 5-gram</bold></td>
<td align="left" valign="top"><bold>over GRNN</bold></td>
<td align="left" valign="top"><bold>over TXL</bold></td>
<td align="left" valign="top"><bold>over GPT-2</bold></td>
<td align="left" valign="top"><bold>Log Lik</bold></td>
<td align="left" valign="top"><bold>r_squared</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">5-gram</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">2 (p = 0.153)</td>
<td align="left" valign="top">3 (p = 0.035)</td>
<td align="left" valign="top">0 (p = 0.611)</td>
<td align="left" valign="top">&#8211;43817</td>
<td align="left" valign="top">0.16</td>
</tr>
<tr>
<td align="left" valign="top">GRNN</td>
<td align="left" valign="top">287 (p &lt; 0.001)</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">113 (p &lt; 0.001)</td>
<td align="left" valign="top">13 (p &lt; 0.001)</td>
<td align="left" valign="top">&#8211;43544</td>
<td align="left" valign="top">0.23</td>
</tr>
<tr>
<td align="left" valign="top">TXL</td>
<td align="left" valign="top">174 (p &lt; 0.001)</td>
<td align="left" valign="top">5 (p = 0.006)</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">2 (p = 0.137)</td>
<td align="left" valign="top">&#8211;43650</td>
<td align="left" valign="top">0.2</td>
</tr>
<tr>
<td align="left" valign="top">GPT-2</td>
<td align="left" valign="top">394 (p &lt; 0.001)</td>
<td align="left" valign="top">113 (p &lt; 0.001)</td>
<td align="left" valign="top">213 (p &lt; 0.001)</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">&#8211;43445</td>
<td align="left" valign="top">0.25</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We followed the same process for the SPR data with results shown in <xref ref-type="table" rid="T2">Table 2</xref>. For SPR, GPT-2 and 5-gram models contain some value over each other model, which is less clear for Transformer-XL and GRNN. In terms of log likelihoods, we find that GPT-2 is better than 5-gram is better than GRNN is better than Transformer-XL, although differences are small. The relatively good fit of 5-gram models to SPR data compared with neural models matches results from Hu et al. (<xref ref-type="bibr" rid="B31">2020</xref>) and Wilcox et al. (<xref ref-type="bibr" rid="B74">2020</xref>), and contrasts with the Maze results, where the 5-gram model had the worst fit and did not provide additional predictive value over the other models. While the nature of the generalizations made by these neural network-based models is not fully understood, controlled tests have suggested that their next-word predictions often reflect deeper features of linguistic structure (<xref ref-type="bibr" rid="B31">Hu et al., 2020</xref>; <xref ref-type="bibr" rid="B71">Warstadt et al., 2020</xref>), such as subject&#8211;verb agreement (<xref ref-type="bibr" rid="B48">Marvin &amp; Linzen, 2018</xref>) and wh-dependencies (<xref ref-type="bibr" rid="B73">Wilcox et al., In press</xref>), and are sensitive over longer context windows, than n-gram models. The fact that the neural language models dominate the 5-gram models for Maze but not SPR thus suggests that Maze RTs may be more sensitive than SPR RTs to richer language structure-related processes during real-time comprehension.</p>
<table-wrap id="T2">
<caption>
<p><bold>Table 2:</bold> Results of model comparisons on SPR data. Each row shows the additional predictive value gained from adding that model to another model. F values and p values from ANOVA tests between 1-surprisal-source and 2-source models are reported. We also report log likelihoods of models with only one surprisal source and the r-squared correlation between the model&#8217;s predictions and the data.</p>
</caption>
<table>
<thead>
<tr>
<td align="left" valign="top"><bold>Model</bold></td>
<td align="left" valign="top"><bold>over 5-gram</bold></td>
<td align="left" valign="top"><bold>over GRNN</bold></td>
<td align="left" valign="top"><bold>over TXL</bold></td>
<td align="left" valign="top"><bold>over GPT-2</bold></td>
<td align="left" valign="top"><bold>Log Lik</bold></td>
<td align="left" valign="top"><bold>r_squared</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">5-gram</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">3 (p = 0.032)</td>
<td align="left" valign="top">4 (p = 0.001)</td>
<td align="left" valign="top">3 (p = 0.033)</td>
<td align="left" valign="top">&#8211;51798</td>
<td align="left" valign="top">0.007</td>
</tr>
<tr>
<td align="left" valign="top">GRNN</td>
<td align="left" valign="top">7 (p &lt; 0.001)</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">6 (p &lt; 0.001)</td>
<td align="left" valign="top">2 (p = 0.153)</td>
<td align="left" valign="top">&#8211;51790</td>
<td align="left" valign="top">0.009</td>
</tr>
<tr>
<td align="left" valign="top">TXL</td>
<td align="left" valign="top">3 (p = 0.010)</td>
<td align="left" valign="top">0 (p = 0.910)</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">1 (p = 0.462)</td>
<td align="left" valign="top">&#8211;51801</td>
<td align="left" valign="top">0.007</td>
</tr>
<tr>
<td align="left" valign="top">GPT-2</td>
<td align="left" valign="top">10 (p &lt; 0.001)</td>
<td align="left" valign="top">5 (p &lt; 0.001)</td>
<td align="left" valign="top">10 (p &lt; 0.001)</td>
<td align="left" valign="top"></td>
<td align="left" valign="top">&#8211;51783</td>
<td align="left" valign="top">0.011</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As an overall measure of fit to data, we calculate multiple R-squared for the single surprisal source models for both A-maze and SPR. The models predict A-maze better than SPR with R-squared values for A-maze ranging from 0.16 for the 5-gram model to 0.25 for GPT-2. For SPR, the R-squared values range from from 0.007 to 0.011. This pattern suggests that the effect size differences are not due merely to the larger overall reading time for A-maze, but that instead A-maze is more sensitive to surprisal and length effects.</p>
</sec>
</sec>
<sec>
<title>5. Discussion</title>
<p>We introduced error-correction Maze, a tweak on the presentation of Maze materials that makes Maze feasible for multi-sentence passages. We then used A-maze distractors and the error-correction Maze presentation to gather data on participants reading stories from the Natural Stories corpus in the Maze. As laid out in the introduction, this current study addressed five main questions.</p>
<p>First, we found that participants could read and comprehend the 1000 word stories, despite the slowness and added overhead of reading in the Maze task. This result expands the domain of materials usable with Maze beyond targeted single-sentence items to longer, naturalistic texts with sentence-to-sentence coherency.</p>
<p>Second, we took advantage of the pre-existing SPR corpus on Natural Stories to compare the RT profiles between Maze and SPR. Maze and SPR pick up on similar features in words, as shown by the high correlations between Maze and SPR RTs on the sentence level. The correlation within Maze is higher than the Maze to SPR correlation or SPR-SPR correlations, which is evidence that Maze is less noisy than SPR.</p>
<p>Third, we addressed whether the A-maze RT for a word showed a linear relationship with that word&#8217;s surprisal. We found that A-maze RTs are linearly related to surprisal, matching the functional profile found with other incremental processing methods.</p>
<p>Fourth, we compared the spillover profiles between Maze and SPR. For Maze, we found large effects of the current word&#8217;s surprisal and length, which dwarfed any spillover effects from previous word predictors. In contrast, for SPR, we found effects of roughly equal sizes from the current and previous words.<xref ref-type="fn" rid="n6">6</xref> Overall, Maze is a slower task than SPR, but it also has much larger effects of length and surprisal, perhaps due to requiring more focus, and thus generating less noisy data. We do not find frequency effects on the Maze data, but we do on the SPR data. This could be explained if frequency effects are a first rough approximation of in-context predictability, before the fuller context-sensitive surprisal information is available. In this case, faster methods like eye-tracking and SPR would show frequency effects (in addition to surprisal), but slower methods like Maze would not as the additional demands slow down the response, allowing more contextual information to be used. While this is a difference between Maze and other incremental processing methods, we do not consider it a flaw for Maze &#8211; indeed, for researchers interested in focusing on context-contingent language processing, it may suggest an advantage for the Maze task. Regardless, these differences highlight the importance of understanding task demands of different incremental processing methods.</p>
<p>Lastly, we examined how different language models fare at predicting human RT data. We found that overall, the models were more predictive of the A-maze data than SPR data; however, the hierarchy of the model&#8217;s predictive performance also differed between the A-maze and SPR datasets. This difference suggests that how well a language model predicts human RTs may depend on task. Maze RTs were by far best predicted by neural network language models, whereas SPR RTs were predicted nearly as well by 5-gram models. Our understanding of the linguistic generalization capabilities and performance of these neural network models is still limited, and there are cases where they are known to make more superficial, non-human-like generalizations (<xref ref-type="bibr" rid="B13">Chaves, 2020</xref>; <xref ref-type="bibr" rid="B49">McCoy et al., 2019</xref>), but controlled tests in the NLP literature that analyze their behavior on classic psycholinguistics paradigms (<xref ref-type="bibr" rid="B24">Futrell et al., 2019</xref>; <xref ref-type="bibr" rid="B44">Linzen et al., 2016</xref>; <xref ref-type="bibr" rid="B71">Warstadt et al., 2020</xref>; <xref ref-type="bibr" rid="B73">Wilcox et al., In press</xref>) suggest more human-like performance than n-gram models are capable of. These findings further add to the evidence that the Maze task is favorable for RT-based investigations of underlying linguistic processing in the human mind. More broadly, further comparisons between different processing methods on the same materials could be useful for a deeper understanding of how task demands influence language processing (ex. <xref ref-type="bibr" rid="B5">Bartek et al., 2011</xref>).</p>
<p>Overall, A-maze has excellent localization, although some models showed small but statistically significant effects of the past word. On the whole, our results support the idea that Maze forces language processing to be close to word-by-word, and thus the Maze task can be used under the assumption that the RT of a word primarily reflects its own properties and not those of earlier words. Correlation analysis between Maze and SPR suggests that Maze is picking up on many of the same patterns as does SPR, but with less noise.</p>
<sec>
<title>5.1 Limitations</title>
<p>While we expect these patterns of results reflect features of the A-maze task, the effects could be moderated by quirks of the materials or the participant population. We excluded a large number of participants for having low accuracy on the task and appearing to guess randomly. We compared RTs collected on the A-maze task to SPR RTs previously collected on the same corpus, but we did not randomly assign participants to SPR and Maze conditions. This study suggests that A-maze is a localized and widely-usable method, but only broader applications can confirm these findings.</p>
</sec>
<sec>
<title>5.2 Future directions</title>
<p>Compared to traditional Maze, in error-correction Maze, participants&#8217; incentives to finish quickly are in less conflict with the experimenter&#8217;s desire that participants do the task as intended. However, even with error-correction Maze, clicking randomly is still likely faster than doing the task. In discussing this work, we received the suggestion that one way to further disincentivize random clicking would be to add a pause when a participant makes a mistake, forcing them to wait some short period of time, such as 500 ms, before correcting their mistake. This delay would make randomly hitting buttons slower than doing the task as intended, and we have made delaying after wrong presses an option in the error-correction Maze implementation at <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/vboyce/Ibex-with-Maze">https://github.com/vboyce/Ibex-with-Maze</ext-link>.</p>
<p>Error-correction Maze records RTs for words after a participant makes a mistake in the sentence. In our analyses, we excluded these post-error data, but we believe it is an open question whether data from after a participant makes a mistake is usable. That is, does it show the same profile as RTs from pre-error words, or are there traces from recovering from the mistake? If there are, how long do these effects take to fade? Whether post-mistake data is high-quality and trustworthy enough to be included in analyses is hard to assess; if it can be used, it would make the Maze task more data efficient.</p>
<p>The Maze task is versatile and can be used or adapted for a wide range of materials and questions of interest. Its forced incrementality makes the Maze task a good target for any question that requires precisely determining the locus of incremental processing difficulty. We encourage researchers to use Maze as an incremental processing method, alone or in comparison with other methods, and we suggest that the error-correction mode be the default choice for presenting Maze materials.</p>
</sec>
</sec>
</body>
<back>
<sec>
<title>Appendix A</title>
<p>The beginning of one of the stories. This excerpt is the first 200 words of a 1000 word story.</p>
<p>Tulip mania was a period in the Dutch Golden Age during which contract prices for bulbs of the recently introduced tulip reached extraordinarily high levels and then suddenly collapsed. At the peak of tulip mania in February sixteen thirty-seven, tulip contracts sold for more than ten times the annual income of a skilled craftsman. It is generally considered the first recorded economic bubble. The tulip, introduced to Europe in the mid sixteenth century from the Ottoman Empire, became very popular in the United Provinces, which we now know as the Netherlands. Tulip cultivation in the United Provinces is generally thought to have started in earnest around fifteen ninety-three, after the Flemish botanist Charles de l&#8217;Ecluse had taken up a post at the University of Leiden and established a botanical garden, which is famous as one of the oldest in the world. There, he planted his collection of tulip bulbs that the Emperor&#8217;s ambassador sent to him from Turkey, which were able to tolerate the harsher conditions of the northern climate. It was shortly thereafter that the tulips began to grow in popularity. The flower rapidly became a coveted luxury item and a status symbol, and a profusion of varieties followed.</p>
<disp-quote>
<p>The first 2 out of the 6 comprehension questions.</p>
<p>When did tulip mania reach its peak? 1630&#8217;s, 1730&#8217;s</p>
<p>From which country did tulips come to Europe? Turkey, Egypt</p>
</disp-quote>
</sec>
<sec>
<title>Appendix B</title>
<p>Full numerical results from the fitted regression models are shown in <xref ref-type="table" rid="T3">Table 3</xref> for A-maze and in <xref ref-type="table" rid="T4">Table 4</xref> for SPR.</p>
<table-wrap id="T3">
<caption>
<p><bold>Table 3:</bold> Predictions from fitted Bayesian regression models. All terms were centered, but not rescaled. Units are in ms. Surprisal is per bit, length per character, and frequency per <italic>log</italic><sub>2</sub> occurance per billion words. Interval is 2.5th quantile to 97.5th quantile of model draws.</p>
</caption>
<table>
<thead>
<tr>
<td align="left" valign="top"><bold>Term</bold></td>
<td align="left" valign="top"><bold>5-gram</bold></td>
<td align="left" valign="top"><bold>GRNN</bold></td>
<td align="left" valign="top"><bold>Transformer-XL</bold></td>
<td align="left" valign="top"><bold>GPT-2</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Intercept</td>
<td align="left" valign="top">876 [840.4, 910.9]</td>
<td align="left" valign="top">876.8 [840.1, 911.5]</td>
<td align="left" valign="top">880 [842.8,914.9]</td>
<td align="left" valign="top">878.5 [845.6, 911.6]</td>
</tr>
<tr>
<td align="left" valign="top">Surprisal</td>
<td align="left" valign="top">11.1 [8.7, 13.6]</td>
<td align="left" valign="top">22.3 [19.7, 25]</td>
<td align="left" valign="top">17.8 [15.3, 20.2]</td>
<td align="left" valign="top">24.2 [21.5, 27]</td>
</tr>
<tr>
<td align="left" valign="top">Length</td>
<td align="left" valign="top">21.4 [16.6, 26.3]</td>
<td align="left" valign="top">17.9 [13.2, 22.7]</td>
<td align="left" valign="top">20.5 [15.6, 25.4]</td>
<td align="left" valign="top">16.2 [11.3, 21.2]</td>
</tr>
<tr>
<td align="left" valign="top">Frequency</td>
<td align="left" valign="top">&#8211;3.2 [&#8211;6.7, 0.5]</td>
<td align="left" valign="top">1.8 [&#8211;1.1, 4.7]</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;3.2, 2.9]</td>
<td align="left" valign="top">&#8211;1.4 [&#8211;4.2, 1.2]</td>
</tr>
<tr>
<td align="left" valign="top">Surp &#215; Length</td>
<td align="left" valign="top">&#8211;2 [&#8211;3, &#8211;0.9]</td>
<td align="left" valign="top">&#8211;2.1 [&#8211;3, &#8211;1.2]</td>
<td align="left" valign="top">&#8211;1.4 [&#8211;2.1, &#8211;0.6]</td>
<td align="left" valign="top">&#8211;1.8 [&#8211;2.7, &#8211;1]</td>
</tr>
<tr>
<td align="left" valign="top">Freq &#215; Length</td>
<td align="left" valign="top">&#8211;1 [&#8211;2.5, 0.6]</td>
<td align="left" valign="top">&#8211;0.4 [&#8211;1.5, 0.7]</td>
<td align="left" valign="top">0.1 [&#8211;1, 1.1]</td>
<td align="left" valign="top">0.1 [&#8211;0.9, 1.1]</td>
</tr>
<tr>
<td align="left" valign="top">Past Surprisal</td>
<td align="left" valign="top">1.5 [&#8211;0.6, 3.5]</td>
<td align="left" valign="top">2.7 [1, 4.4]</td>
<td align="left" valign="top">0.9 [&#8211;0.7, 2.5]</td>
<td align="left" valign="top">3.5 [1.8, 5.3]</td>
</tr>
<tr>
<td align="left" valign="top">Past Length</td>
<td align="left" valign="top">&#8211;3.5 [&#8211;7.8, 0.7]</td>
<td align="left" valign="top">&#8211;4.8 [&#8211;9, &#8211;0.8]</td>
<td align="left" valign="top">&#8211;3.7 [&#8211;7.7, 0.3]</td>
<td align="left" valign="top">&#8211;5.1 [&#8211;9.2, &#8211;1.1]</td>
</tr>
<tr>
<td align="left" valign="top">Past Freq</td>
<td align="left" valign="top">2.5 [&#8211;0.3, 5.4]</td>
<td align="left" valign="top">1.8 [&#8211;0.4, 4]</td>
<td align="left" valign="top">1 [&#8211;1.3, 3.3]</td>
<td align="left" valign="top">0.7 [&#8211;1.4, 2.8]</td>
</tr>
<tr>
<td align="left" valign="top">Past Surp &#215; Length</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;1.1, 0.8]</td>
<td align="left" valign="top">&#8211;0.9 [&#8211;1.7, &#8211;0.2]</td>
<td align="left" valign="top">&#8211;0.5 [&#8211;1.2, 0.2]</td>
<td align="left" valign="top">&#8211;1.1 [&#8211;1.8, &#8211;0.4]</td>
</tr>
<tr>
<td align="left" valign="top">Past Freq &#215; Length</td>
<td align="left" valign="top">&#8211;1 [&#8211;2.4, 0.4]</td>
<td align="left" valign="top">&#8211;1.8 [&#8211;2.8, &#8211;0.8]</td>
<td align="left" valign="top">&#8211;1.5 [&#8211;2.5, &#8211;0.4]</td>
<td align="left" valign="top">&#8211;1.7 [&#8211;2.7, &#8211;0.8]</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4">
<caption>
<p><bold>Table 4:</bold> Predictions from fitted regression models for SPR data. All terms were centered, but not rescaled. Units are in ms. Surprisal is per bit, length per character, and frequency per <italic>log</italic><sub>2</sub> occurance per billion words. Uncertainty interval is +/&#8211;1.97 standard error.</p>
</caption>
<table>
<thead>
<tr>
<td align="left" valign="top"><bold>Term</bold></td>
<td align="left" valign="top"><bold>5-gram</bold></td>
<td align="left" valign="top"><bold>GRNN</bold></td>
<td align="left" valign="top"><bold>Transformer-XL</bold></td>
<td align="left" valign="top"><bold>GPT-2</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Intercept</td>
<td align="left" valign="top">361.6 [344.5, 378.6]</td>
<td align="left" valign="top">363.8 [346.8, 380.8]</td>
<td align="left" valign="top">363.9 [346.9, 380.9]</td>
<td align="left" valign="top">363.9 [346.9, 380.9]</td>
</tr>
<tr>
<td align="left" valign="top">Surprisal</td>
<td align="left" valign="top">1.1 [0.2, 2.1]</td>
<td align="left" valign="top">1.8 [1, 2.7]</td>
<td align="left" valign="top">1.1 [0.3, 1.9]</td>
<td align="left" valign="top">1.1 [0.3, 1.9]</td>
</tr>
<tr>
<td align="left" valign="top">Length</td>
<td align="left" valign="top">2.1 [0, 4.2]</td>
<td align="left" valign="top">2 [&#8211;0.1, 4]</td>
<td align="left" valign="top">2.2 [0.1, 4.2]</td>
<td align="left" valign="top">2.2 [0.1, 4.2]</td>
</tr>
<tr>
<td align="left" valign="top">Frequency</td>
<td align="left" valign="top">1.4 [0, 2.8]</td>
<td align="left" valign="top">1.6 [0.5, 2.8]</td>
<td align="left" valign="top">1.2 [0.1, 2.4]</td>
<td align="left" valign="top">1.2 [0.1, 2.4]</td>
</tr>
<tr>
<td align="left" valign="top">Surp &#215; Length</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;0.6, 0.3]</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;0.6, 0.1]</td>
<td align="left" valign="top">0.1 [&#8211;0.3, 0.4]</td>
<td align="left" valign="top">0.1 [&#8211;0.3, 0.4]</td>
</tr>
<tr>
<td align="left" valign="top">Freq &#215; Length</td>
<td align="left" valign="top">&#8211;0.4 [&#8211;1, 0.2]</td>
<td align="left" valign="top">&#8211;0.4 [&#8211;0.9, 0.1]</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;0.7, 0.3]</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;0.7, 0.3]</td>
</tr>
<tr>
<td align="left" valign="top">Past Surprisal</td>
<td align="left" valign="top">1 [0.1, 1.9]</td>
<td align="left" valign="top">0.9 [0.1, 1.7]</td>
<td align="left" valign="top">0.7 [0, 1.5]</td>
<td align="left" valign="top">0.7 [0, 1.5]</td>
</tr>
<tr>
<td align="left" valign="top">Past Length</td>
<td align="left" valign="top">0.1 [&#8211;2, 2.1]</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;2.1, 1.9]</td>
<td align="left" valign="top">0.1 [&#8211;1.9, 2.1]</td>
<td align="left" valign="top">0.1 [&#8211;1.9, 2.1]</td>
</tr>
<tr>
<td align="left" valign="top">Past Freq</td>
<td align="left" valign="top">1.5 [0.2, 2.9]</td>
<td align="left" valign="top">1.1 [0, 2.2]</td>
<td align="left" valign="top">1.1 [0, 2.2]</td>
<td align="left" valign="top">1.1 [0, 2.2]</td>
</tr>
<tr>
<td align="left" valign="top">Past Surp &#215; Length</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;0.5, 0.3]</td>
<td align="left" valign="top">0 [&#8211;0.4, 0.3]</td>
<td align="left" valign="top">0.2 [&#8211;0.2, 0.5]</td>
<td align="left" valign="top">0.2 [&#8211;0.2, 0.5]</td>
</tr>
<tr>
<td align="left" valign="top">Past Freq &#215; Length</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;0.8, 0.5]</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;0.6, 0.4]</td>
<td align="left" valign="top">0.1 [&#8211;0.4, 0.6]</td>
<td align="left" valign="top">0.1 [&#8211;0.4, 0.6]</td>
</tr>
<tr>
<td align="left" valign="top">2Past Surprisal</td>
<td align="left" valign="top">0.6 [&#8211;0.4, 1.5]</td>
<td align="left" valign="top">0 [&#8211;0.8, 0.8]</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;1, 0.6]</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;1, 0.6]</td>
</tr>
<tr>
<td align="left" valign="top">2Past Length</td>
<td align="left" valign="top">2.2 [0.3, 4.2]</td>
<td align="left" valign="top">2.1 [0.2, 4]</td>
<td align="left" valign="top">2.1 [0.2, 4]</td>
<td align="left" valign="top">2.1 [0.2, 4]</td>
</tr>
<tr>
<td align="left" valign="top">2Past Freq</td>
<td align="left" valign="top">1.5 [0.2, 2.8]</td>
<td align="left" valign="top">0.8 [&#8211;0.3, 1.9]</td>
<td align="left" valign="top">0.7 [&#8211;0.5, 1.8]</td>
<td align="left" valign="top">0.7 [&#8211;0.5, 1.8]</td>
</tr>
<tr>
<td align="left" valign="top">2Past Surp &#215; Length</td>
<td align="left" valign="top">&#8211;0.3 [&#8211;0.7, 0.2]</td>
<td align="left" valign="top">&#8211;0.3 [&#8211;0.6, 0.1]</td>
<td align="left" valign="top">0 [&#8211;0.3, 0.4]</td>
<td align="left" valign="top">0 [&#8211;0.3, 0.4]</td>
</tr>
<tr>
<td align="left" valign="top">2Past Freq &#215; Length</td>
<td align="left" valign="top">&#8211;0.3 [&#8211;1, 0.3]</td>
<td align="left" valign="top">&#8211;0.3 [&#8211;0.7, 0.2]</td>
<td align="left" valign="top">0 [&#8211;0.5, 0.4]</td>
<td align="left" valign="top">0 [&#8211;0.5, 0.4]</td>
</tr>
<tr>
<td align="left" valign="top">3Past Surprisal</td>
<td align="left" valign="top">&#8211;0.3 [&#8211;1.3, 0.6]</td>
<td align="left" valign="top">&#8211;1 [&#8211;1.8, &#8211;0.2]</td>
<td align="left" valign="top">&#8211;0.9 [&#8211;1.7, &#8211;0.2]</td>
<td align="left" valign="top">&#8211;0.9 [&#8211;1.7, &#8211;0.2]</td>
</tr>
<tr>
<td align="left" valign="top">3Past Length</td>
<td align="left" valign="top">1.1 [&#8211;0.9, 3]</td>
<td align="left" valign="top">1.1 [&#8211;0.9, 3]</td>
<td align="left" valign="top">0.8 [&#8211;1.1, 2.7]</td>
<td align="left" valign="top">0.8 [&#8211;1.1, 2.7]</td>
</tr>
<tr>
<td align="left" valign="top">3Past Freq</td>
<td align="left" valign="top">0.4 [&#8211;1, 1.7]</td>
<td align="left" valign="top">0 [&#8211;1.1, 1.1]</td>
<td align="left" valign="top">0 [&#8211;1.2, 1.1]</td>
<td align="left" valign="top">0 [&#8211;1.2, 1.1]</td>
</tr>
<tr>
<td align="left" valign="top">3Past Surp &#215; Length</td>
<td align="left" valign="top">&#8211;0.5 [&#8211;0.9, 0]</td>
<td align="left" valign="top">&#8211;0.3 [&#8211;0.6, 0.1]</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;0.4, 0.3]</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;0.4, 0.3]</td>
</tr>
<tr>
<td align="left" valign="top">3Past Freq &#215; Length</td>
<td align="left" valign="top">&#8211;0.4 [&#8211;1.1, 0.2]</td>
<td align="left" valign="top">&#8211;0.2 [&#8211;0.7, 0.3]</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;0.6, 0.4]</td>
<td align="left" valign="top">&#8211;0.1 [&#8211;0.6, 0.4]</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Appendix C</title>
<p>We use <monospace>mgcv&#8217;s ti()</monospace> tensor interaction terms to test all main effects and two-way interactions among frequency, length, and surprisal for the current word and for the previous word. These effects are visualized in <xref ref-type="fig" rid="F8">Figure 8</xref> and <monospace>mgcv&#8217;s</monospace> approximate significance levels are give in <xref ref-type="table" rid="T5">Table 5</xref>. Based on these approximate significance levels, the main effects of current and previous word surprisal and length are significant, as are the current-word frequency-by-length and frequency-by-surprisal interactions; other terms are not statistically significant. These significant interactions can be summarized as especially long, infrequent words being especially slow to select; especially frequent and surprising words being especially slow to select; and especially infrequent and surprising words being less slow to select than a main-effects-only model would predict. The data driving these interactions are in the sparse tails of the word length and surprisal distributions, and as the <italic>F</italic> statistics in <xref ref-type="table" rid="T5">Table 5</xref> show, their variance explained is small relative to the large effect of current-word surprisal, so in the main-text analysis we set these interactions aside.</p>
<fig id="F8">
<caption>
<p><bold>Figure 8:</bold> Generalized Additive Model main effects and two-way interactions among frequency, length, and surprisal for A-Maze reading of the Natural Stories corpus. Confidence bands do not take into account the uncertainty associated with <monospace>mgcv</monospace> hyperparameter estimation (<xref ref-type="bibr" rid="B81">Wood, 2017</xref>).</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossapx-2-1-190-g8.png"/>
</fig>
<table-wrap id="T5">
<caption>
<p><bold>Table 5:</bold> Significance of Generalized Additive Model main effects and two-way interactions among frequency, length, and surprisal for A-Maze reading of the Natural Stories corpus.</p>
</caption>
<table>
<thead>
<tr>
<td align="left" valign="top"><bold>Term</bold></td>
<td align="left" valign="top"><bold>F-statistic</bold></td>
<td align="left" valign="top"><bold>p value</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">ti (surprisal)</td>
<td align="left" valign="top">95.6500</td>
<td align="left" valign="top">p &lt; 0.0001</td>
</tr>
<tr>
<td align="left" valign="top">ti (freq)</td>
<td align="left" valign="top">1.5420</td>
<td align="left" valign="top">p = 0.2267</td>
</tr>
<tr>
<td align="left" valign="top">ti (len)</td>
<td align="left" valign="top">8.2840</td>
<td align="left" valign="top">p = 0.0005</td>
</tr>
<tr>
<td align="left" valign="top">ti (freq,len)</td>
<td align="left" valign="top">4.6700</td>
<td align="left" valign="top">p &lt; 0.0001</td>
</tr>
<tr>
<td align="left" valign="top">ti (surprisal,len)</td>
<td align="left" valign="top">1.0300</td>
<td align="left" valign="top">p = 0.2418</td>
</tr>
<tr>
<td align="left" valign="top">ti (surprisal,freq)</td>
<td align="left" valign="top">14.9500</td>
<td align="left" valign="top">p &lt; 0.0001</td>
</tr>
<tr>
<td align="left" valign="top">ti (prev_surp)</td>
<td align="left" valign="top">6.6160</td>
<td align="left" valign="top">p &lt; 0.0001</td>
</tr>
<tr>
<td align="left" valign="top">ti (prev_freq)</td>
<td align="left" valign="top">2.1670</td>
<td align="left" valign="top">p = 0.0797</td>
</tr>
<tr>
<td align="left" valign="top">ti (prev_len)</td>
<td align="left" valign="top">3.0360</td>
<td align="left" valign="top">p = 0.0291</td>
</tr>
<tr>
<td align="left" valign="top">ti (prev_freq,prev_len)</td>
<td align="left" valign="top">0.4666</td>
<td align="left" valign="top">p = 0.6971</td>
</tr>
<tr>
<td align="left" valign="top">ti (prev_surp,prev_len)</td>
<td align="left" valign="top">2.5120</td>
<td align="left" valign="top">p = 0.1240</td>
</tr>
<tr>
<td align="left" valign="top">ti (prev_surp,prev_freq)</td>
<td align="left" valign="top">2.6470</td>
<td align="left" valign="top">p = 0.1014</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Appendix D</title>
<p>The <monospace>mgcv</monospace> package&#8217;s implementation of Generalized Additive Models (<xref ref-type="bibr" rid="B81">Wood, 2017</xref>) allows linear and nonparametric spline effects of the same continuous predictor to be entered simultaneously into a model. Doing so associates only the nonlinear part of the effect to the spline term, allowing for approximate statistical testing of the linear and non-linear components of the effect respectively. We thus test for whether the effect of surprisal on A-Maze RTs is best described as linear or includes a non-linear component, using the <monospace>mgcv</monospace> formula:</p>
<disp-quote>
<p><monospace>rt &#126; surprisal + s (surprisal, bs=&#8220;cr&#8221;, k=20) + ti (freq, bs=&#8220;cr&#8221;) + ti (len, bs=&#8220;cr&#8221;) + prev_surp + s (prev_surp, bs=&#8220;cr&#8221;, k=20) + ti (prev_freq, bs=&#8220;cr&#8221;) + ti (prev_len, bs=&#8220;cr&#8221;)</monospace></p>
</disp-quote>
<p>The results are in <xref ref-type="table" rid="T6">Table 6</xref>. For all but the 5-gram surprisal estimate, there is overwhelming evidence for a linear contribution of current-word surprisal, but little to no evidence for a non-linear contribution. For the 5-gram estimate, there is overwhelming evidence for the linear term, and some evidence for a nonlinearity as well. Consulting <xref ref-type="fig" rid="F4">Figure 4</xref> shows that this nonlinearity takes the form of the surprisal effect dwindling to zero in the sparse tail of high-surprisal words. This nonlinearity is plausibly due to the measurement error (high variance) of using counts to estimate very low multinomial probabilities. Taken together with the 5-gram model&#8217;s inferior overall fit, we conclude from this analysis that the evidence is quite strong that, like self-paced reading and eye tracking, A-Maze reading of naturalistic texts exhibits a linear effect of surprisal on RTs.</p>
<table-wrap id="T6">
<caption>
<p><bold>Table 6:</bold> Comparison of significances for linear and spline terms of suprisals from a GAM. We fit GAM models with current and past word surprisal as parametric terms, current and past word surprisal and spline terms, and current and past frequence and length as tensors to predict reading time. Here we show the estimated pvalues for the linear and spline surprisal terms at current and past words. The spline terms account for any non-linear surprisal effects.</p>
</caption>
<table>
<thead>
<tr>
<td align="left" valign="top"><bold>Term</bold></td>
<td align="left" valign="top"><bold>5-gram</bold></td>
<td align="left" valign="top"><bold>GRNN</bold></td>
<td align="left" valign="top"><bold>Transformer-XL</bold></td>
<td align="left" valign="top"><bold>GPT-2</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Spline Surprisal</td>
<td align="left" valign="top">p = 0.0015</td>
<td align="left" valign="top">p = 0.7013</td>
<td align="left" valign="top">p = 0.7107</td>
<td align="left" valign="top">p = 0.0529</td>
</tr>
<tr>
<td align="left" valign="top">Spline Past Surprisal</td>
<td align="left" valign="top">p = 0.9861</td>
<td align="left" valign="top">p = 0.8792</td>
<td align="left" valign="top">p = 0.9835</td>
<td align="left" valign="top">p = 0.3778</td>
</tr>
<tr>
<td align="left" valign="top">Linear Surprisal</td>
<td align="left" valign="top">p &lt; 0.0001</td>
<td align="left" valign="top">p &lt; 0.0001</td>
<td align="left" valign="top">p &lt; 0.0001</td>
<td align="left" valign="top">p &lt; 0.0001</td>
</tr>
<tr>
<td align="left" valign="top">Linear Past Surprisal</td>
<td align="left" valign="top">p = 0.8607</td>
<td align="left" valign="top">p = 0.0540</td>
<td align="left" valign="top">p = 0.7558</td>
<td align="left" valign="top">p = 0.0002</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<fn-group>
<fn id="n1"><p>Of course, not all effects can necessarily be reduced to word predictability effects, and effects that <italic>cannot</italic> be reduced to word predictability may be of particular theoretical interest. Candidates include, for example, memory-based effects (<xref ref-type="bibr" rid="B41">Levy et al., 2013</xref>; <xref ref-type="bibr" rid="B42">Lewis et al., 2006</xref>), noisy-channel error identification (<xref ref-type="bibr" rid="B40">Levy et al., 2009</xref>), and the magnitude of processing difficulty in garden-path resolution (<xref ref-type="bibr" rid="B69">Van Schijndel &amp; Linzen, 2021</xref>; <xref ref-type="bibr" rid="B75">Wilcox et al., 2021</xref>).</p></fn>
<fn id="n2"><p>We, furthermore, used the R-packages <italic>bookdown</italic> (Version 0.29; <xref ref-type="bibr" rid="B83">Xie, 2016</xref>), <italic>brms</italic> (Version 2.18.0; <xref ref-type="bibr" rid="B9">B&#252;rkner, 2017</xref>, <xref ref-type="bibr" rid="B10">2018</xref>, <xref ref-type="bibr" rid="B11">2021</xref>), <italic>broom.mixed</italic> (Version 0.2.9.4; <xref ref-type="bibr" rid="B7">Bolker &amp; Robinson, 2022</xref>), <italic>cowplot</italic> (Version 1.1.1; <xref ref-type="bibr" rid="B76">Wilke, 2020</xref>), <italic>gridExtra</italic> (Version 2.3; <xref ref-type="bibr" rid="B2">Auguie, 2017</xref>), <italic>here</italic> (Version 1.0.1; <xref ref-type="bibr" rid="B52">M&#252;ller, 2020</xref>), <italic>kableExtra</italic> (Version 1.3.4; <xref ref-type="bibr" rid="B84">Zhu, 2021</xref>), <italic>lme4</italic> (Version 1.1.31; <xref ref-type="bibr" rid="B6">Bates et al., 2015</xref>), <italic>mgcv</italic> (<xref ref-type="bibr" rid="B78">Wood, 2003</xref>, <xref ref-type="bibr" rid="B79">2004</xref>; Version 1.8.41; <xref ref-type="bibr" rid="B80">Wood, 2011</xref>; <xref ref-type="bibr" rid="B82">Wood et al., 2016</xref>), <italic>mgcViz</italic> (Version 0.1.9; <xref ref-type="bibr" rid="B19">Fasiolo et al., 2018</xref>), <italic>papaja</italic> (Version 0.1.1; <xref ref-type="bibr" rid="B3">Aust &amp; Barth, 2022</xref>), <italic>patchwork</italic> (Version 1.1.2; <xref ref-type="bibr" rid="B55">Pedersen, 2022</xref>), <italic>rticles</italic> (Version 0.24.4; <xref ref-type="bibr" rid="B1">Allaire et al., 2022</xref>), <italic>tidybayes</italic> (Version 3.0.2; <xref ref-type="bibr" rid="B32">Kay, 2022</xref>), <italic>tidymv</italic> (Version 3.3.2; <xref ref-type="bibr" rid="B15">Coretta, 2022</xref>), and <italic>tidyverse</italic> (Version 1.3.2; <xref ref-type="bibr" rid="B72">Wickham et al., 2019</xref>).</p></fn>
<fn id="n3"><p>Surprisals should be additive, but summing the surprisals for multi-token words gave some unreasonable responses. For instance, in one story the word &#8220;king!&#8217;&#8221; has a surprisal of 64 under GRNN (context: The other birds gave out one by one and when the eagle saw this he thought, &#8216;What is the use of flying any higher? This victory is in the bag and I am king!&#8217;). While GPT-2 using byte-pair encoding that can split up words into multiple parts, excluding words it split up only excluded 30 words that were not already excluded by other models.</p></fn>
<fn id="n4"><p>To avoid biasing the average if a participant took a pause before returning to the task, RTs greater than 5 seconds were excluded. This exclusion removed 260 words, or 0.27% of trials.</p></fn>
<fn id="n5"><p>Due to previous reports of a length&#8211;frequency interaction in RT measures (<xref ref-type="bibr" rid="B34">Kliegl et al., 2006</xref>), before pursuing our primary question of the functional form of the surprisal&#8211;RT relationship, as an exploratory measure we fit generalized additive models (GAMs) with not only the main effects but also the two-way interactions between surprisal, length, and frequency, for the current word and for the previous word. This analysis revealed significant effects of current-word and previous-word surprisal, current-word and previous-word length, and significant interactions of current-word frequency by length and frequency by surprisal. The other main effects and interactions did not reach statistical significance. (These are results from <monospace>mgcv&#8217;s</monospace> <monospace>summary()</monospace>; the <italic>p</italic>-values are approximate.) Appendix C provides tables and plots of these effects and interactions for GPT-2. The interactions can be summarized as long low-frequency words and surprising, high-frequency words as having especially long RTs; and surprising, low-frequency words as having shorter RTs than would otherwise be predicted. However, these effects are small in terms of variance explained compared to the current-word surprisal effect, which is by far the largest single effect in the model. For simplicity we therefore set aside the interaction terms involving surprisal for the remainder of this analysis.</p></fn>
<fn id="n6"><p>Furthermore, the typical spillover profile for SPR data may be worse than suggested by the Natural Stories corpus SPR data: for example, Smith &amp; Levy (<xref ref-type="bibr" rid="B64">2013</xref>) found that most of a word&#8217;s surprisal effect showed up only one to two words downstream.</p></fn>
</fn-group>
<sec>
<title>Data accessibility</title>
<p>Data and materials are available at <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/vboyce/natural-stories-maze">https://github.com/vboyce/natural-stories-maze</ext-link>. DOI: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="10.5281/zenodo.7783046">10.5281/zenodo.7783046</ext-link>.</p>
</sec>
<sec>
<title>Ethics and consent</title>
<p>This research was approved by MIT&#8217;s Committee on the Use of Humans as Experimental Subjects and run under protocol number 1605559077.</p>
</sec>
<sec>
<title>Funding information</title>
<p>RPL acknowledges support from NSF grant BCS-2121074, NIH grant U01-NS121471, and the MIT&#8211;IBM Artificial Intelligence Research Lab.</p>
</sec>
<ack>
<title>Acknowledgements</title>
<p>We thank the AMLAP 2020 audience, the Computational Psycholinguistics Lab at MIT, the Language and Cognition Lab at Stanford, the QuantLang Lab at UC Irvine, and Mike Frank for feedback on this work.</p>
</ack>
<sec>
<title>Competing interests</title>
<p>The authors have no competing interests to declare.</p>
</sec>
<sec>
<title>Author contributions</title>
<p>VB contributed Conceptualization, Formal Analysis, Investigation, Methodology, Software, and Writing &#8211; Original Draft Preparation. RPL contributed Conceptualization, Formal Analysis, Funding Acquisition, Methodology, Supervision, and Writing &#8211; Review &amp; Editing.</p>
</sec>
<ref-list>
<ref id="B1"><label>1</label><mixed-citation publication-type="webpage"><string-name><surname>Allaire</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Xie</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Dervieux</surname>, <given-names>C.</given-names></string-name>, <collab>R Foundation</collab>, <string-name><surname>Wickham</surname>, <given-names>H.</given-names></string-name>, <collab>Journal of Statistical Software</collab>, <string-name><surname>Vaidyanathan</surname>, <given-names>R.</given-names></string-name>, <collab>Association for Computing Machinery</collab>, <string-name><surname>Boettiger</surname>, <given-names>C.</given-names></string-name>, <collab>Elsevier</collab>, <string-name><surname>Broman</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Mueller</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Quast</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Pruim</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Marwick</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Wickham</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Keyes</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Emaasit</surname>, <given-names>D.</given-names></string-name>, &#8230; <string-name><surname>Hyndman</surname>, <given-names>R.</given-names></string-name> (<year>2022</year>). <source>Rticles: Article formats for r markdown</source>. <uri>https://github.com/rstudio/rticles</uri></mixed-citation></ref>
<ref id="B2"><label>2</label><mixed-citation publication-type="webpage"><string-name><surname>Auguie</surname>, <given-names>B.</given-names></string-name> (<year>2017</year>). <source>gridExtra: Miscellaneous functions for &#8220;grid&#8221; graphics</source>. <uri>https://CRAN.R-project.org/package=gridExtra</uri></mixed-citation></ref>
<ref id="B3"><label>3</label><mixed-citation publication-type="webpage"><string-name><surname>Aust</surname>, <given-names>F.</given-names></string-name>, &amp; <string-name><surname>Barth</surname>, <given-names>M.</given-names></string-name> (<year>2022</year>). <source>papaja: Prepare reproducible APA journal articles with R Markdown</source>. <uri>https://github.com/crsh/papaja</uri></mixed-citation></ref>
<ref id="B4"><label>4</label><mixed-citation publication-type="journal"><string-name><surname>Barr</surname>, <given-names>D. J.</given-names></string-name>, <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Scheepers</surname>, <given-names>C.</given-names></string-name>, &amp; <string-name><surname>Tily</surname>, <given-names>H. J.</given-names></string-name> (<year>2013</year>). <article-title>Random effects structure for confirmatory hypothesis testing: Keep it maximal</article-title>. <source>Journal of Memory and Language</source>, <volume>68</volume>(<issue>3</issue>), <fpage>255</fpage>&#8211;<lpage>278</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.jml.2012.11.001</pub-id></mixed-citation></ref>
<ref id="B5"><label>5</label><mixed-citation publication-type="journal"><string-name><surname>Bartek</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Lewis</surname>, <given-names>R. L.</given-names></string-name>, <string-name><surname>Vasishth</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name><surname>Smith</surname>, <given-names>M. R.</given-names></string-name> (<year>2011</year>). <article-title>In search of on-line locality effects in sentence comprehension</article-title>. <source>Journal of Experimental Psychology: Human Perception &amp; Performance</source>, <volume>37</volume>(<issue>5</issue>), <fpage>1178</fpage>. DOI: <pub-id pub-id-type="doi">10.1037/a0024194</pub-id></mixed-citation></ref>
<ref id="B6"><label>6</label><mixed-citation publication-type="journal"><string-name><surname>Bates</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>M&#228;chler</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Bolker</surname>, <given-names>B.</given-names></string-name>, &amp; <string-name><surname>Walker</surname>, <given-names>S.</given-names></string-name> (<year>2015</year>). <article-title>Fitting linear mixed-effects models using lme4</article-title>. <source>Journal of Statistical Software</source>, <volume>67</volume>(<issue>1</issue>), <fpage>1</fpage>&#8211;<lpage>48</lpage>. DOI: <pub-id pub-id-type="doi">10.18637/jss.v067.i01</pub-id></mixed-citation></ref>
<ref id="B7"><label>7</label><mixed-citation publication-type="webpage"><string-name><surname>Bolker</surname>, <given-names>B.</given-names></string-name>, &amp; <string-name><surname>Robinson</surname>, <given-names>D.</given-names></string-name> (<year>2022</year>). <source>Broom.mixed: Tidying methods for mixed models</source>. <uri>https://CRAN.R-project.org/package=broom.mixed</uri></mixed-citation></ref>
<ref id="B8"><label>8</label><mixed-citation publication-type="journal"><string-name><surname>Boyce</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Futrell</surname>, <given-names>R.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R. P.</given-names></string-name> (<year>2020</year>). <article-title>Maze made easy: Better and easier measurement of incremental processing difficulty</article-title>. <source>Journal of Memory and Language</source>, <volume>111</volume>, <fpage>104082</fpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.jml.2019.104082</pub-id></mixed-citation></ref>
<ref id="B9"><label>9</label><mixed-citation publication-type="journal"><string-name><surname>B&#252;rkner</surname>, <given-names>P.-C.</given-names></string-name> (<year>2017</year>). <article-title>brms: An R package for Bayesian multilevel models using Stan</article-title>. <source>Journal of Statistical Software</source>, <volume>80</volume>(<issue>1</issue>), <fpage>1</fpage>&#8211;<lpage>28</lpage>. DOI: <pub-id pub-id-type="doi">10.18637/jss.v080.i01</pub-id></mixed-citation></ref>
<ref id="B10"><label>10</label><mixed-citation publication-type="journal"><string-name><surname>B&#252;rkner</surname>, <given-names>P.-C.</given-names></string-name> (<year>2018</year>). <article-title>Advanced Bayesian multilevel modeling with the R package brms</article-title>. <source>The R Journal</source>, <volume>10</volume>(<issue>1</issue>), <fpage>395</fpage>&#8211;<lpage>411</lpage>. DOI: <pub-id pub-id-type="doi">10.32614/RJ-2018-017</pub-id></mixed-citation></ref>
<ref id="B11"><label>11</label><mixed-citation publication-type="journal"><string-name><surname>B&#252;rkner</surname>, <given-names>P.-C.</given-names></string-name> (<year>2021</year>). <article-title>Bayesian item response modeling in R with brms and Stan</article-title>. <source>Journal of Statistical Software</source>, <volume>100</volume>(<issue>5</issue>), <fpage>1</fpage>&#8211;<lpage>54</lpage>. DOI: <pub-id pub-id-type="doi">10.18637/jss.v100.i05</pub-id></mixed-citation></ref>
<ref id="B12"><label>12</label><mixed-citation publication-type="journal"><string-name><surname>Chac&#243;n</surname>, <given-names>D. A.</given-names></string-name>, <string-name><surname>Kort</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>O&#8217;Neill</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name><surname>Sorensen</surname>, <given-names>T.</given-names></string-name> (<year>2021</year>). <source>Limits on semantic prediction in the processing of extraction from adjunct clauses</source>. DOI: <pub-id pub-id-type="doi">10.31234/osf.io/9rfmw</pub-id></mixed-citation></ref>
<ref id="B13"><label>13</label><mixed-citation publication-type="webpage"><string-name><surname>Chaves</surname>, <given-names>R.</given-names></string-name> (<year>2020</year>). <article-title>What don&#8217;t RNN language models learn about filler-gap dependencies?</article-title> <source>Proceedings of the Society for Computation in Linguistics 2020</source>, <fpage>1</fpage>&#8211;<lpage>11</lpage>. <uri>https://aclanthology.org/2020.scil-1.1</uri></mixed-citation></ref>
<ref id="B14"><label>14</label><mixed-citation publication-type="journal"><string-name><surname>Chmielewski</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name><surname>Kucker</surname>, <given-names>S. C.</given-names></string-name> (<year>2020</year>). <article-title>An MTurk crisis? Shifts in data quality and the impact on study results</article-title>. <source>Social Psychological and Personality Science</source>, <volume>11</volume>(<issue>4</issue>), <fpage>464</fpage>&#8211;<lpage>473</lpage>. DOI: <pub-id pub-id-type="doi">10.1177/1948550619875149</pub-id></mixed-citation></ref>
<ref id="B15"><label>15</label><mixed-citation publication-type="webpage"><string-name><surname>Coretta</surname>, <given-names>S.</given-names></string-name> (<year>2022</year>). <source>Tidymv: Tidy model visualisation for generalised additive models</source>. <uri>https://CRAN.R-project.org/package=tidymv</uri></mixed-citation></ref>
<ref id="B16"><label>16</label><mixed-citation publication-type="journal"><string-name><surname>Dai</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Carbonell</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Le</surname>, <given-names>Q. V.</given-names></string-name>, &amp; <string-name><surname>Salakhutdinov</surname>, <given-names>R.</given-names></string-name> (<year>2019</year>). <article-title>Transformer-XL: Attentive language models beyond a fixed-length context</article-title>. <italic>arXiv:1901.02860 [Cs, Stat]</italic>. DOI: <pub-id pub-id-type="doi">10.18653/v1/P19-1285</pub-id></mixed-citation></ref>
<ref id="B17"><label>17</label><mixed-citation publication-type="journal"><string-name><surname>Demberg</surname>, <given-names>V.</given-names></string-name>, &amp; <string-name><surname>Keller</surname>, <given-names>F.</given-names></string-name> (<year>2008</year>). <article-title>Data from eye-tracking corpora as evidence for theories of syntactic processing complexity</article-title>. <source>Cognition</source>, <volume>109</volume>(<issue>2</issue>), <fpage>193</fpage>&#8211;<lpage>210</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2008.07.008</pub-id></mixed-citation></ref>
<ref id="B18"><label>18</label><mixed-citation publication-type="journal"><string-name><surname>Eyal</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>David</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Andrew</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Zak</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name><surname>Ekaterina</surname>, <given-names>D.</given-names></string-name> (<year>2021</year>). <article-title>Data quality of platforms and panels for online behavioral research</article-title>. <source>Behav Res</source>. DOI: <pub-id pub-id-type="doi">10.3758/s13428-021-01694-3</pub-id></mixed-citation></ref>
<ref id="B19"><label>19</label><mixed-citation publication-type="webpage"><string-name><surname>Fasiolo</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Nedellec</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Goude</surname>, <given-names>Y.</given-names></string-name>, &amp; <string-name><surname>Wood</surname>, <given-names>S. N.</given-names></string-name> (<year>2018</year>). <article-title>Scalable visualisation methods for modern generalized additive models</article-title>. <source>Arxiv Preprint</source>. <uri>https://arxiv.org/abs/1809.10632</uri></mixed-citation></ref>
<ref id="B20"><label>20</label><mixed-citation publication-type="journal"><string-name><surname>Forster</surname>, <given-names>K. I.</given-names></string-name>, <string-name><surname>Guerrera</surname>, <given-names>C.</given-names></string-name>, &amp; <string-name><surname>Elliot</surname>, <given-names>L.</given-names></string-name> (<year>2009</year>). <article-title>The maze task: Measuring forced incremental sentence processing time</article-title>. <source>Behavior Research Methods</source>, <volume>41</volume>(<issue>1</issue>), <fpage>163</fpage>&#8211;<lpage>171</lpage>. DOI: <pub-id pub-id-type="doi">10.3758/BRM.41.1.163</pub-id></mixed-citation></ref>
<ref id="B21"><label>21</label><mixed-citation publication-type="journal"><string-name><surname>Frazier</surname>, <given-names>L.</given-names></string-name>, &amp; <string-name><surname>Rayner</surname>, <given-names>K.</given-names></string-name> (<year>1982</year>). <article-title>Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences</article-title>. <source>Cognitive Psychology</source>, <volume>14</volume>, <fpage>178</fpage>&#8211;<lpage>210</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/0010-0285(82)90008-1</pub-id></mixed-citation></ref>
<ref id="B22"><label>22</label><mixed-citation publication-type="journal"><string-name><surname>Freedman</surname>, <given-names>S. E.</given-names></string-name>, &amp; <string-name><surname>Forster</surname>, <given-names>K. I.</given-names></string-name> (<year>1985</year>). <article-title>The psychological status of overgenerated sentences</article-title>. <source>Cognition</source>, <volume>19</volume>(<issue>2</issue>), <fpage>101</fpage>&#8211;<lpage>131</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/0010-0277(85)90015-0</pub-id></mixed-citation></ref>
<ref id="B23"><label>23</label><mixed-citation publication-type="journal"><string-name><surname>Futrell</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Gibson</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Tily</surname>, <given-names>H. J.</given-names></string-name>, <string-name><surname>Blank</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Vishnevetsky</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Piantadosi</surname>, <given-names>S. T.</given-names></string-name>, &amp; <string-name><surname>Fedorenko</surname>, <given-names>E.</given-names></string-name> (<year>2020</year>). <article-title>The Natural Stories corpus: A reading-time corpus of English texts containing rare syntactic constructions</article-title>. <source>Lang Resources &amp; Evaluation</source>. DOI: <pub-id pub-id-type="doi">10.1007/s10579-020-09503-7</pub-id></mixed-citation></ref>
<ref id="B24"><label>24</label><mixed-citation publication-type="webpage"><string-name><surname>Futrell</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Wilcox</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Morita</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Qian</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Ballesteros</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2019</year>). <article-title>Neural language models as psycholinguistic subjects: Representations of syntactic state</article-title>. <source>Proceedings of the 18th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>, <fpage>32</fpage>&#8211;<lpage>42</lpage>. <uri>https://www.aclweb.org/anthology/N19-1004</uri>. DOI: <pub-id pub-id-type="doi">10.18653/v1/N19-1004</pub-id></mixed-citation></ref>
<ref id="B25"><label>25</label><mixed-citation publication-type="journal"><string-name><surname>Gauthier</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wilcox</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Qian</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2020</year>). <article-title>SyntaxGym: An online platform for targeted evaluation of language models</article-title>. <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>, <fpage>70</fpage>&#8211;<lpage>76</lpage>. DOI: <pub-id pub-id-type="doi">10.18653/v1/2020.acl-demos.10</pub-id></mixed-citation></ref>
<ref id="B26"><label>26</label><mixed-citation publication-type="journal"><string-name><surname>Goodkind</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name><surname>Bicknell</surname>, <given-names>K.</given-names></string-name> (<year>2018</year>). <article-title>Predictive power of word surprisal for reading times is a linear function of language model quality</article-title>. <source>Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018)</source>, <fpage>10</fpage>&#8211;<lpage>18</lpage>. DOI: <pub-id pub-id-type="doi">10.18653/v1/W18-0102</pub-id></mixed-citation></ref>
<ref id="B27"><label>27</label><mixed-citation publication-type="journal"><string-name><surname>Grodner</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name><surname>Gibson</surname>, <given-names>E.</given-names></string-name> (<year>2005</year>). <article-title>Some consequences of the serial nature of linguistic input</article-title>. <source>Cognitive Science</source>, <volume>29</volume>(<issue>2</issue>), <fpage>261</fpage>&#8211;<lpage>290</lpage>. DOI: <pub-id pub-id-type="doi">10.1207/s15516709cog0000_7</pub-id></mixed-citation></ref>
<ref id="B28"><label>28</label><mixed-citation publication-type="journal"><string-name><surname>Gulordava</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Bojanowski</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Grave</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Linzen</surname>, <given-names>T.</given-names></string-name>, &amp; <string-name><surname>Baroni</surname>, <given-names>M.</given-names></string-name> (<year>2018</year>). <article-title>Colorless green recurrent networks dream hierarchically</article-title>. <source>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>, <fpage>1195</fpage>&#8211;<lpage>1205</lpage>. DOI: <pub-id pub-id-type="doi">10.18653/v1/N18-1108</pub-id></mixed-citation></ref>
<ref id="B29"><label>29</label><mixed-citation publication-type="webpage"><string-name><surname>Hale</surname>, <given-names>J.</given-names></string-name> (<year>2001</year>). <article-title>A probabilistic Earley parser as a psycholinguistic model</article-title>. <source>Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics</source>, <fpage>159</fpage>&#8211;<lpage>166</lpage>. <uri>http://acl.ldc.upenn.edu/N/N01/N01-1021.pdf</uri>. DOI: <pub-id pub-id-type="doi">10.3115/1073336.1073357</pub-id></mixed-citation></ref>
<ref id="B30"><label>30</label><mixed-citation publication-type="journal"><string-name><surname>Hauser</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Paolacci</surname>, <given-names>G.</given-names></string-name>, &amp; <string-name><surname>Chandler</surname>, <given-names>J. J.</given-names></string-name> (<year>2018</year>). <source>Common concerns with MTurk as a participant pool: Evidence and solutions</source> [Preprint]. PsyArXiv. DOI: <pub-id pub-id-type="doi">10.31234/osf.io/uq45c</pub-id></mixed-citation></ref>
<ref id="B31"><label>31</label><mixed-citation publication-type="journal"><string-name><surname>Hu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Gauthier</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Qian</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Wilcox</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R. P.</given-names></string-name> (<year>2020</year>). <article-title>A systematic assessment of syntactic generalization in neural language models</article-title>. <source>arXiv:2005.03692 [Cs]</source>. DOI: <pub-id pub-id-type="doi">10.18653/v1/2020.acl-main.158</pub-id></mixed-citation></ref>
<ref id="B32"><label>32</label><mixed-citation publication-type="journal"><string-name><surname>Kay</surname>, <given-names>M.</given-names></string-name> (<year>2022</year>). <source>tidybayes: Tidy data and geoms for Bayesian models</source>. DOI: <pub-id pub-id-type="doi">10.5281/zenodo.1308151</pub-id></mixed-citation></ref>
<ref id="B33"><label>33</label><mixed-citation publication-type="journal"><string-name><surname>Kliegl</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Grabner</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Rolfs</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name><surname>Engbert</surname>, <given-names>R.</given-names></string-name> (<year>2004</year>). <article-title>Length, frequency, and predictability effects of words on eye movements in reading</article-title>. <source>European Journal of Cognitive Psychology</source>, <volume>16</volume>(<issue>1&#8211;2</issue>), <fpage>262</fpage>&#8211;<lpage>284</lpage>. DOI: <pub-id pub-id-type="doi">10.1080/09541440340000213</pub-id></mixed-citation></ref>
<ref id="B34"><label>34</label><mixed-citation publication-type="journal"><string-name><surname>Kliegl</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Nuthmann</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name><surname>Engbert</surname>, <given-names>R.</given-names></string-name> (<year>2006</year>). <article-title>Tracking the mind during reading: The influence of past, present, and future words on fixation durations</article-title>. <source>Jepgen</source>, <volume>135</volume>(<issue>1</issue>), <fpage>12</fpage>&#8211;<lpage>35</lpage>. DOI: <pub-id pub-id-type="doi">10.1037/0096-3445.135.1.12</pub-id></mixed-citation></ref>
<ref id="B35"><label>35</label><mixed-citation publication-type="journal"><string-name><surname>Koornneef</surname>, <given-names>A. W.</given-names></string-name>, &amp; <string-name><surname>van Berkum</surname>, <given-names>J. J. A.</given-names></string-name> (<year>2006</year>). <article-title>On the use of verb-based implicit causality in sentence comprehension: Evidence from self-paced reading and eye tracking</article-title>. <source>Journal of Memory and Language</source>, <volume>54</volume>(<issue>4</issue>), <fpage>445</fpage>&#8211;<lpage>465</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.jml.2005.12.003</pub-id></mixed-citation></ref>
<ref id="B36"><label>36</label><mixed-citation publication-type="journal"><string-name><surname>Kutas</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name><surname>Hillyard</surname>, <given-names>S. A.</given-names></string-name> (<year>1980</year>). <article-title>Reading senseless sentences: Brain potentials reflect semantic incongruity</article-title>. <source>Science</source>, <volume>207</volume>(<issue>4427</issue>), <fpage>203</fpage>&#8211;<lpage>205</lpage>. DOI: <pub-id pub-id-type="doi">10.1126/science.7350657</pub-id></mixed-citation></ref>
<ref id="B37"><label>37</label><mixed-citation publication-type="journal"><string-name><surname>Levinson</surname>, <given-names>L.</given-names></string-name> (<year>2022</year>). <source>Beyond surprising: English event structure in the maze</source>. Presentation at the 35th Annual Conference on Human Sentence Processing. DOI: <pub-id pub-id-type="doi">10.3765/elm.2.5384</pub-id></mixed-citation></ref>
<ref id="B38"><label>38</label><mixed-citation publication-type="journal"><string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2008</year>). <article-title>Expectation-based syntactic comprehension</article-title>. <source>Cognition</source>, <volume>106</volume>(<issue>3</issue>), <fpage>1126</fpage>&#8211;<lpage>1177</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2007.05.006</pub-id></mixed-citation></ref>
<ref id="B39"><label>39</label><mixed-citation publication-type="book"><string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2013</year>). <chapter-title>Memory and surprisal in human sentence comprehension</chapter-title>. In <string-name><given-names>R. P. G.</given-names> <surname>van Gompel</surname></string-name> (Ed.), <source>Sentence processing</source> (pp. <fpage>78</fpage>&#8211;<lpage>114</lpage>). <publisher-name>Psychology Press</publisher-name>.</mixed-citation></ref>
<ref id="B40"><label>40</label><mixed-citation publication-type="journal"><string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Bicknell</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Slattery</surname>, <given-names>T.</given-names></string-name>, &amp; <string-name><surname>Rayner</surname>, <given-names>K.</given-names></string-name> (<year>2009</year>). <article-title>Eye movement evidence that readers maintain and act on uncertainty about past linguistic input</article-title>. <source>Proceedings of the National Academy of Sciences</source>, <volume>106</volume>(<issue>50</issue>), <fpage>21086</fpage>&#8211;<lpage>21090</lpage>. DOI: <pub-id pub-id-type="doi">10.1073/pnas.0907664106</pub-id></mixed-citation></ref>
<ref id="B41"><label>41</label><mixed-citation publication-type="journal"><string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Fedorenko</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name><surname>Gibson</surname>, <given-names>E.</given-names></string-name> (<year>2013</year>). <article-title>The syntactic complexity of Russian relative clauses</article-title>. <source>Journal of Memory and Language</source>, <volume>69</volume>(<issue>4</issue>), <fpage>461</fpage>&#8211;<lpage>495</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.jml.2012.10.005</pub-id></mixed-citation></ref>
<ref id="B42"><label>42</label><mixed-citation publication-type="webpage"><string-name><surname>Lewis</surname>, <given-names>R. L.</given-names></string-name>, <string-name><surname>Vasishth</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name><surname>Van Dyke</surname>, <given-names>J.</given-names></string-name> (<year>2006</year>). <article-title>Computational principles of working memory in sentence comprehension</article-title>. <source>Trends in Cognitive Sciences</source>, <volume>10</volume>(<issue>10</issue>), <fpage>447</fpage>&#8211;<lpage>454</lpage>. <uri>http://ling.ucsd.edu/&#126;rlevy/lign274/papers/lewis-etal-2006.pdf</uri>. DOI: <pub-id pub-id-type="doi">10.1016/j.tics.2006.08.007</pub-id></mixed-citation></ref>
<ref id="B43"><label>43</label><mixed-citation publication-type="journal"><string-name><surname>Lieburg</surname>, <given-names>R. van</given-names></string-name>, <string-name><surname>Hartsuiker</surname>, <given-names>R. J.</given-names></string-name>, &amp; <string-name><surname>Bernolet</surname>, <given-names>S.</given-names></string-name> (<year>2022</year>). <source>Using the Maze task paradigm to test structural priming in comprehension in L1 and L2 speakers of English</source>. Presentation at the 35th Annual Conference on Human Sentence Processing.</mixed-citation></ref>
<ref id="B44"><label>44</label><mixed-citation publication-type="journal"><string-name><surname>Linzen</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Dupoux</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name><surname>Goldberg</surname>, <given-names>Y.</given-names></string-name> (<year>2016</year>). <article-title>Assessing the ability of LSTMs to learn syntax-sensitive dependencies</article-title>. <source>Transactions of the Association for Computational Linguistics</source>. DOI: <pub-id pub-id-type="doi">10.1162/tacl_a_00115</pub-id></mixed-citation></ref>
<ref id="B45"><label>45</label><mixed-citation publication-type="journal"><string-name><surname>Luke</surname>, <given-names>S. G.</given-names></string-name>, &amp; <string-name><surname>Christianson</surname>, <given-names>K.</given-names></string-name> (<year>2016</year>). <article-title>Limits on lexical prediction during reading</article-title>. <source>Cognitive Psychology</source>, <volume>88</volume>, <fpage>22</fpage>&#8211;<lpage>60</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cogpsych.2016.06.002</pub-id></mixed-citation></ref>
<ref id="B46"><label>46</label><mixed-citation publication-type="journal"><string-name><surname>MacDonald</surname>, <given-names>M. C.</given-names></string-name> (<year>1993</year>). <article-title>The interaction of lexical and syntactic ambiguity</article-title>. <source>Journal of Memory and Language</source>, <volume>32</volume>, <fpage>692</fpage>&#8211;<lpage>715</lpage>. DOI: <pub-id pub-id-type="doi">10.1006/jmla.1993.1035</pub-id></mixed-citation></ref>
<ref id="B47"><label>47</label><mixed-citation publication-type="journal"><string-name><surname>Marslen-Wilson</surname>, <given-names>W.</given-names></string-name> (<year>1975</year>). <article-title>Sentence perception as an interactive parallel process</article-title>. <source>Science</source>, <volume>189</volume>(<issue>4198</issue>), <fpage>226</fpage>&#8211;<lpage>228</lpage>. DOI: <pub-id pub-id-type="doi">10.1126/science.189.4198.226</pub-id></mixed-citation></ref>
<ref id="B48"><label>48</label><mixed-citation publication-type="webpage"><string-name><surname>Marvin</surname>, <given-names>R.</given-names></string-name>, &amp; <string-name><surname>Linzen</surname>, <given-names>T.</given-names></string-name> (<year>2018</year>). <article-title>Targeted syntactic evaluation of language models</article-title>. <source>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>, <fpage>1192</fpage>&#8211;<lpage>1202</lpage>. <uri>https://www.aclweb.org/anthology/D18-1151</uri>. DOI: <pub-id pub-id-type="doi">10.18653/v1/D18-1151</pub-id></mixed-citation></ref>
<ref id="B49"><label>49</label><mixed-citation publication-type="journal"><string-name><surname>McCoy</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Pavlick</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name><surname>Linzen</surname>, <given-names>T.</given-names></string-name> (<year>2019</year>). <article-title>Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference</article-title>. <source>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>, <fpage>3428</fpage>&#8211;<lpage>3448</lpage>. DOI: <pub-id pub-id-type="doi">10.18653/v1/P19-1334</pub-id></mixed-citation></ref>
<ref id="B50"><label>50</label><mixed-citation publication-type="book"><string-name><surname>Mitchell</surname>, <given-names>D. C.</given-names></string-name> (<year>1984</year>). <chapter-title>An evaluation of subject-paced reading tasks and other methods for investigating immediate processes in reading</chapter-title>. In <string-name><given-names>D.</given-names> <surname>Kieras</surname></string-name> &amp; <string-name><given-names>M. A.</given-names> <surname>Just</surname></string-name> (Eds.), <source>New methods in reading comprehension</source>. <publisher-name>Earlbaum</publisher-name>.</mixed-citation></ref>
<ref id="B51"><label>51</label><mixed-citation publication-type="book"><string-name><surname>Mitchell</surname>, <given-names>D. C.</given-names></string-name> (<year>2004</year>). <chapter-title>On-line methods in language processing: Introduction and historical review</chapter-title>. In <string-name><given-names>M.</given-names> <surname>Carreiras</surname></string-name> &amp; <string-name><given-names>C.</given-names> <surname>Clifton</surname> <suffix>Jr.</suffix></string-name> (Eds.), <source>The on-line study of sentence comprehension: Eye-tracking, ERP and beyond</source> (pp. <fpage>15</fpage>&#8211;<lpage>32</lpage>). <publisher-name>Routledge</publisher-name>.</mixed-citation></ref>
<ref id="B52"><label>52</label><mixed-citation publication-type="webpage"><string-name><surname>M&#252;ller</surname>, <given-names>K.</given-names></string-name> (<year>2020</year>). <source>Here: A simpler way to find your files</source>. <uri>https://CRAN.R-project.org/package=here</uri></mixed-citation></ref>
<ref id="B53"><label>53</label><mixed-citation publication-type="journal"><string-name><surname>Orth</surname>, <given-names>W.</given-names></string-name>, &amp; <string-name><surname>Yoshida</surname>, <given-names>M.</given-names></string-name> (<year>2022</year>). <article-title>Processing profile for quantifiers in verb phrase ellipsis: Evidence for grammatical economy</article-title>. <source>Proceedings of the Linguistic Society of America</source>, <volume>7</volume>(<issue>1</issue>), <fpage>5210</fpage>. DOI: <pub-id pub-id-type="doi">10.3765/plsa.v7i1.5210</pub-id></mixed-citation></ref>
<ref id="B54"><label>54</label><mixed-citation publication-type="webpage"><string-name><surname>Osterhout</surname>, <given-names>L.</given-names></string-name>, &amp; <string-name><surname>Holcomb</surname>, <given-names>P.</given-names></string-name> (<year>1992</year>). <article-title>Event-related brain potentials elicited by syntactic anomaly</article-title>. <source>Journal of Memory and Language</source>, <volume>31</volume>(<issue>6</issue>), <fpage>785</fpage>&#8211;<lpage>606</lpage>. <uri>http://cat.inist.fr/?aModele=afficheN&amp;cpsidt=4397093</uri>. DOI: <pub-id pub-id-type="doi">10.1016/0749-596X(92)90039-Z</pub-id></mixed-citation></ref>
<ref id="B55"><label>55</label><mixed-citation publication-type="webpage"><string-name><surname>Pedersen</surname>, <given-names>T. L.</given-names></string-name> (<year>2022</year>). <source>Patchwork: The composer of plots</source>. <uri>https://CRAN.R-project.org/package=patchwork</uri></mixed-citation></ref>
<ref id="B56"><label>56</label><mixed-citation publication-type="journal"><string-name><surname>Peer</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Brandimarte</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Samat</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name><surname>Acquisti</surname>, <given-names>A.</given-names></string-name> (<year>2017</year>). <article-title>Beyond the Turk: Alternative platforms for crowdsourcing behavioral research</article-title>. <source>Journal of Experimental Social Psychology</source>, <volume>70</volume>, <fpage>153</fpage>&#8211;<lpage>163</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.jesp.2017.01.006</pub-id></mixed-citation></ref>
<ref id="B57"><label>57</label><mixed-citation publication-type="webpage"><collab>R Core Team</collab>. (<year>2022</year>). <source>R: A language and environment for statistical computing</source>. <publisher-name>R Foundation for Statistical Computing</publisher-name>. <uri>https://www.R-project.org/</uri></mixed-citation></ref>
<ref id="B58"><label>58</label><mixed-citation publication-type="journal"><string-name><surname>Radford</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Child</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Luan</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Amodei</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name><surname>Sutskever</surname>, <given-names>I.</given-names></string-name> (n.d.). <source>Language models are unsupervised multitask learners</source>, <fpage>24</fpage>.</mixed-citation></ref>
<ref id="B59"><label>59</label><mixed-citation publication-type="journal"><string-name><surname>Rayner</surname>, <given-names>K.</given-names></string-name> (<year>1998</year>). <article-title>Eye movements in reading and information processing: 20 years of research</article-title>. <source>Psychological Bulletin</source>, <volume>124</volume>(<issue>3</issue>), <fpage>372</fpage>&#8211;<lpage>422</lpage>. DOI: <pub-id pub-id-type="doi">10.1037/0033-2909.124.3.372</pub-id></mixed-citation></ref>
<ref id="B60"><label>60</label><mixed-citation publication-type="journal"><string-name><surname>Rayner</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Ashby</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Pollatsek</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name><surname>Reichle</surname>, <given-names>E. D.</given-names></string-name> (<year>2004</year>). <article-title>The effects of frequency and predictability on eye fixations in reading: Implications for the E-Z Reader model</article-title>. <source>Journal of Experimental Psychology: Human Perception and Performance</source>, <volume>30</volume>(<issue>4</issue>), <fpage>720</fpage>&#8211;<lpage>732</lpage>. DOI: <pub-id pub-id-type="doi">10.1037/0096-1523.30.4.720</pub-id></mixed-citation></ref>
<ref id="B61"><label>61</label><mixed-citation publication-type="journal"><string-name><surname>Shain</surname>, <given-names>C.</given-names></string-name> (<year>2019</year>). <article-title>A large-scale study of the effects of word frequency and predictability in naturalistic reading</article-title>. <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</source>, <fpage>4086</fpage>&#8211;<lpage>4094</lpage>. DOI: <pub-id pub-id-type="doi">10.18653/v1/N19-1413</pub-id></mixed-citation></ref>
<ref id="B62"><label>62</label><mixed-citation publication-type="journal"><string-name><surname>Shain</surname>, <given-names>C.</given-names></string-name>, &amp; <string-name><surname>Schuler</surname>, <given-names>W.</given-names></string-name> (<year>2018</year>). <article-title>Deconvolutional time series regression: A technique for modeling temporally diffuse effects</article-title>. <source>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>, <fpage>2679</fpage>&#8211;<lpage>2689</lpage>. DOI: <pub-id pub-id-type="doi">10.18653/v1/D18-1288</pub-id></mixed-citation></ref>
<ref id="B63"><label>63</label><mixed-citation publication-type="journal"><string-name><surname>Sloggett</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Handel</surname>, <given-names>N. V.</given-names></string-name>, &amp; <string-name><surname>Rysling</surname>, <given-names>A.</given-names></string-name> (<year>2020</year>). <article-title>A-maze by any other name</article-title>. <source>CUNY</source>.</mixed-citation></ref>
<ref id="B64"><label>64</label><mixed-citation publication-type="journal"><string-name><surname>Smith</surname>, <given-names>N. J.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2013</year>). <article-title>The effect of word predictability on reading time is logarithmic</article-title>. <source>Cognition</source>, <volume>128</volume>(<issue>3</issue>), <fpage>302</fpage>&#8211;<lpage>319</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2013.02.013</pub-id></mixed-citation></ref>
<ref id="B65"><label>65</label><mixed-citation publication-type="journal"><string-name><surname>Smith</surname>, <given-names>N. J.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2008</year>). <article-title>Optimal processing times in reading: A formal model and empirical investigation</article-title>. <source>Proceedings of the 30th Annual Meeting of the Cognitive Science Society</source>, <fpage>595</fpage>&#8211;<lpage>600</lpage>.</mixed-citation></ref>
<ref id="B66"><label>66</label><mixed-citation publication-type="journal"><string-name><surname>Staub</surname>, <given-names>A.</given-names></string-name> (<year>2010</year>). <article-title>Eye movements and processing difficulty in object relative clauses</article-title>. <source>Cognition</source>, <volume>116</volume>, <fpage>71</fpage>&#8211;<lpage>86</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2010.04.002</pub-id></mixed-citation></ref>
<ref id="B67"><label>67</label><mixed-citation publication-type="journal"><string-name><surname>Traxler</surname>, <given-names>M. J.</given-names></string-name>, <string-name><surname>Morris</surname>, <given-names>R. K.</given-names></string-name>, &amp; <string-name><surname>Seely</surname>, <given-names>R. E.</given-names></string-name> (<year>2002</year>). <article-title>Processing subject and object relative clauses: Evidence from eye movements</article-title>. <source>Journal of Memory and Language</source>, <volume>47</volume>, <fpage>69</fpage>&#8211;<lpage>90</lpage>. DOI: <pub-id pub-id-type="doi">10.1006/jmla.2001.2836</pub-id></mixed-citation></ref>
<ref id="B68"><label>68</label><mixed-citation publication-type="journal"><string-name><surname>Ungerer</surname>, <given-names>T.</given-names></string-name> (<year>2021</year>). <article-title>Using structural priming to test links between constructions: English caused-motion and resultative sentences inhibit each other</article-title>. <source>Cognitive Linguistics</source>, <volume>32</volume>(<issue>3</issue>), <fpage>389</fpage>&#8211;<lpage>420</lpage>. DOI: <pub-id pub-id-type="doi">10.1515/cog-2020-0016</pub-id></mixed-citation></ref>
<ref id="B69"><label>69</label><mixed-citation publication-type="journal"><string-name><surname>Van Schijndel</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name><surname>Linzen</surname>, <given-names>T.</given-names></string-name> (<year>2021</year>). <article-title>Single-stage prediction models do not explain the magnitude of syntactic disambiguation difficulty</article-title>. <source>Cognitive Science</source>, <volume>45</volume>(<issue>6</issue>), <elocation-id>e12988</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1111/cogs.12988</pub-id></mixed-citation></ref>
<ref id="B70"><label>70</label><mixed-citation publication-type="journal"><string-name><surname>Vani</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Wilcox</surname>, <given-names>E. G.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2021</year>). <article-title>Using the interpolated Maze task to assess incremental processing in English relative clauses</article-title>. <source>Proceedings of the Annual Meeting of the Cognitive Science Society</source>, <volume>43</volume>(<issue>43</issue>).</mixed-citation></ref>
<ref id="B71"><label>71</label><mixed-citation publication-type="journal"><string-name><surname>Warstadt</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Parrish</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Mohananey</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Peng</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.-F.</given-names></string-name>, &amp; <string-name><surname>Bowman</surname>, <given-names>S. R.</given-names></string-name> (<year>2020</year>). <article-title>BLiMP: The benchmark of linguistic minimal pairs for English</article-title>. <source>Transactions of the Association for Computational Linguistics</source>, <volume>8</volume>, <fpage>377</fpage>&#8211;<lpage>392</lpage>. DOI: <pub-id pub-id-type="doi">10.1162/tacl_a_00321</pub-id></mixed-citation></ref>
<ref id="B72"><label>72</label><mixed-citation publication-type="journal"><string-name><surname>Wickham</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Averick</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Bryan</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Chang</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>McGowan</surname>, <given-names>L. D.</given-names></string-name>, <string-name><surname>Fran&#231;ois</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Grolemund</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Hayes</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Henry</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Hester</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kuhn</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Pedersen</surname>, <given-names>T. L.</given-names></string-name>, <string-name><surname>Miller</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Bache</surname>, <given-names>S. M.</given-names></string-name>, <string-name><surname>M&#252;ller</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Ooms</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Robinson</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Seidel</surname>, <given-names>D. P.</given-names></string-name>, <string-name><surname>Spinu</surname>, <given-names>V.</given-names></string-name>, &#8230; <string-name><surname>Yutani</surname>, <given-names>H.</given-names></string-name> (<year>2019</year>). <article-title>Welcome to the tidyverse</article-title>. <source>Journal of Open Source Software</source>, <volume>4</volume>(<issue>43</issue>), <fpage>1686</fpage>. DOI: <pub-id pub-id-type="doi">10.21105/joss.01686</pub-id></mixed-citation></ref>
<ref id="B73"><label>73</label><mixed-citation publication-type="journal"><string-name><surname>Wilcox</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Futrell</surname>, <given-names>R.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R. P.</given-names></string-name> (In press). <article-title>Using computational models to test syntactic learnability</article-title>. <source>Linguistic Inquiry</source>.</mixed-citation></ref>
<ref id="B74"><label>74</label><mixed-citation publication-type="journal"><string-name><surname>Wilcox</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Gauthier</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Qian</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2020</year>). <article-title>On the predictive power of neural language models for human real-time comprehension behavior</article-title>. <italic>arXiv:2006.01912 [Cs]</italic>. DOI: <pub-id pub-id-type="doi">10.18653/v1/2021.acl-long.76</pub-id></mixed-citation></ref>
<ref id="B75"><label>75</label><mixed-citation publication-type="journal"><string-name><surname>Wilcox</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Vani</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name><surname>Levy</surname>, <given-names>R.</given-names></string-name> (<year>2021</year>). <article-title>A Targeted Assessment of Incremental Processing in Neural Language Models and Humans</article-title>. <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>, <fpage>939</fpage>&#8211;<lpage>952</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2021.acl-long.76</pub-id></mixed-citation></ref>
<ref id="B76"><label>76</label><mixed-citation publication-type="webpage"><string-name><surname>Wilke</surname>, <given-names>C. O.</given-names></string-name> (<year>2020</year>). <source>Cowplot: Streamlined plot theme and plot annotations for &#8216;ggplot2&#8217;</source>. <uri>https://CRAN.R-project.org/package=cowplot</uri></mixed-citation></ref>
<ref id="B77"><label>77</label><mixed-citation publication-type="journal"><string-name><surname>Witzel</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Witzel</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name><surname>Forster</surname>, <given-names>K.</given-names></string-name> (<year>2012</year>). <article-title>Comparisons of online reading paradigms: Eye tracking, moving-window, and maze</article-title>. <source>Journal of Psycholinguistic Research</source>, <volume>41</volume>(<issue>2</issue>), <fpage>105</fpage>&#8211;<lpage>128</lpage>. DOI: <pub-id pub-id-type="doi">10.1007/s10936-011-9179-x</pub-id></mixed-citation></ref>
<ref id="B78"><label>78</label><mixed-citation publication-type="journal"><string-name><surname>Wood</surname>, <given-names>S.</given-names></string-name> (<year>2003</year>). <article-title>Thin-plate regression splines</article-title>. <source>Journal of the Royal Statistical Society (B)</source>, <volume>65</volume>(<issue>1</issue>), <fpage>95</fpage>&#8211;<lpage>114</lpage>. DOI: <pub-id pub-id-type="doi">10.1111/1467-9868.00374</pub-id></mixed-citation></ref>
<ref id="B79"><label>79</label><mixed-citation publication-type="journal"><string-name><surname>Wood</surname>, <given-names>S.</given-names></string-name> (<year>2004</year>). <article-title>Stable and efficient multiple smoothing parameter estimation for generalized additive models</article-title>. <source>Journal of the American Statistical Association</source>, <volume>99</volume>(<issue>467</issue>), <fpage>673</fpage>&#8211;<lpage>686</lpage>. DOI: <pub-id pub-id-type="doi">10.1198/016214504000000980</pub-id></mixed-citation></ref>
<ref id="B80"><label>80</label><mixed-citation publication-type="journal"><string-name><surname>Wood</surname>, <given-names>S.</given-names></string-name> (<year>2011</year>). <article-title>Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models</article-title>. <source>Journal of the Royal Statistical Society (B)</source>, <volume>73</volume>(<issue>1</issue>), <fpage>3</fpage>&#8211;<lpage>36</lpage>. DOI: <pub-id pub-id-type="doi">10.1111/j.1467-9868.2010.00749.x</pub-id></mixed-citation></ref>
<ref id="B81"><label>81</label><mixed-citation publication-type="book"><string-name><surname>Wood</surname>, <given-names>S.</given-names></string-name> (<year>2017</year>). <source>Generalized additive models: An introduction with R</source> (<edition>2nd</edition> ed.). <publisher-name>CRC</publisher-name>. DOI: <pub-id pub-id-type="doi">10.1201/9781315370279</pub-id></mixed-citation></ref>
<ref id="B82"><label>82</label><mixed-citation publication-type="journal"><string-name><surname>Wood</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Pya</surname>, <given-names>N.</given-names></string-name>, &amp; <string-name><surname>S&#228;fken</surname>, <given-names>B.</given-names></string-name> (<year>2016</year>). <article-title>Smoothing parameter and model selection for general smooth models (with discussion)</article-title>. <source>Journal of the American Statistical Association</source>, <volume>111</volume>, <fpage>1548</fpage>&#8211;<lpage>1575</lpage>. DOI: <pub-id pub-id-type="doi">10.1080/01621459.2016.1180986</pub-id></mixed-citation></ref>
<ref id="B83"><label>83</label><mixed-citation publication-type="webpage"><string-name><surname>Xie</surname>, <given-names>Y.</given-names></string-name> (<year>2016</year>). <source>Bookdown: Authoring books and technical documents with R markdown</source>. <publisher-loc>Chapman</publisher-loc>; <publisher-name>Hall/CRC</publisher-name>. <uri>https://bookdown.org/yihui/bookdown</uri>. DOI: <pub-id pub-id-type="doi">10.1201/9781315204963</pub-id></mixed-citation></ref>
<ref id="B84"><label>84</label><mixed-citation publication-type="webpage"><string-name><surname>Zhu</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <source>kableExtra: Construct complex table with &#8216;kable&#8217; and pipe syntax</source>. <uri>https://CRAN.R-project.org/package=kableExtra</uri></mixed-citation></ref>
</ref-list>
</back>
</article>