Transcription

Using the Penn Parsed Corpora ofHistorical English with CorpusSearchWaseda Workshop on the PPCHEAnthony Kroch & Beatrice SantoriniUniversity of PennsylvaniaDecember 12, 2017

Slides for this workshopwww.ling.upenn.edu/ kroch/handouts/ CorpusSearch user’s ts.html Annotation manual for PPCHEwww.ling.upenn.edu/ beatrice/annotation/

What is a morphosyntacticallyannotated corpus?

morphological taggingcase, gender, number features on nounstense, mood, aspect features on verbs, etc. lemmatizationword sense disambiguationspelling normalization part of speech taggingelementary syntactic functions syntactic parsinghierarchical structure of phrases/clausesgrammatical function of phrases/clauses

An example sentence((IP-MAT (NP-SBJ (PRO They))(HVP have)(NP-OB1 (D a)(ADJ native)(N justice)(, ,)(CP-REL (WNP-1 (WPRO which))(C 0)(IP-SUB (NP-SBJ *T*-1)(VBP knows)(NP-OB1 (Q no)(N fraud)))))(. ;))(ID BEHN-E3-P1,150.48))

Reformated sentence for P-SUB(NP-SBJ(NP-OB1(PRO They))(HVP have)(D a)(ADJ native)(N justice)(, ,)(WPRO which))(C 0)*T*-1)(VBP knows)(Q no)(N fraud)))))(. ;))(ID BEHN-E3-P1,150.48))

The example with N,(,(ORTHO They)(METAWORD(LEMMA (HEADWORD they)(OEDID 200700)))))(ORTHO have)(METAWORD(LEMMA (HEADWORD have)(OEDID 84705))))(ORTHO a)(METAWORD(LEMMA (HEADWORD a)(OEDID 4))))(ORTHO native)(METAWORD(LEMMA (HEADWORD native)(OEDID 125304))))(ORTHO justice)(METAWORD(LEMMA (HEADWORD justice)(OEDID 102198)))))(ORTHO ,)(METAWORD(LEMMA (HEADWORD ,)(OEDID NA)))),,

(ID(ORTHO which)(METAWORD(LEMMA (HEADWORD which)(OEDID 228284)))))(METAWORD (ALT-ORTHO 0)(LEMMA 0)))(METAWORD (ALT-ORTHO *T*-1)(LEMMA 0)))(ORTHO knows)(METAWORD(LEMMA (HEADWORD know)(OEDID 104157))))(ORTHO no)(METAWORD(LEMMA (HEADWORD no)(OEDID 127437))))(ORTHO fraud)(METAWORD(LEMMA (HEADWORD fraud)(OEDID 74298))))))));(ORTHO )(METAWORD.(LEMMA (HEADWORD )(OEDID NA)))))BEHN-E3-P1,150.48))

morphological taggingcase, gender, number features on nounstense, mood, aspect features on verbs, etc. lemmatizationword sense disambiguationspelling normalization part of speech taggingelementary syntactic functions syntactic parsinghierarchical structure of phrases/clausesgrammatical function of phrases/clauses

The annotation task Annotation is multilevel and complex, sothat using human effort for the whole job isimpractical. At the same time, accuracy is crucial andunattainable at present with fully automatedmethods. In consequence, parsed corpora are built byinterleaving automated analysis with humancorrection of the output.

Annotation software Wide range of software for automatic part-of-speech tagging and other software for automaticparsing. Software for correcting the errors of automatedtaggers. Annotald software for the correction of theerrors of automatic parsers (annotald.github.io). CorpusSearch revision queries for semi-automatic parsing and parsing correction.

Available parsed corpusresources for Europeanlanguages using the Pennannotation scheme

English Parsed Corpora, I Anthony Kroch and Ann Taylor. Penn-Helsinki Parsed Corpus ofMiddle English, second edition. University of Pennsylvania, 2000.(http://www.ling.upenn.edu/hist-corpora)1.3 million words Anthony Kroch, Beatrice Santorini, and Ariel Diertani. PennHelsinki Parsed Corpus of Early Modern English. University ofPennsylvania, 2004.1.8 million words Anthony Kroch, Beatrice Santorini, and Ariel Diertani. PennParsed Corpus of Modern British English. University ofPennsylvania, 2010.3.0 million words

English Parsed Corpora, II Ann Taylor, Anthony Warner, Susan Pintzuk, and Frank Beths.York-Toronto-Helsinki Parsed Corpus of Old English Prose, firstedition. Oxford Text Archive, 2003.(http://www-users.york.ac.uk/ lang22/YCOE/YcoeHome.htm)1.5 million words Ann Taylor, Arja Nurmi, Anthony Warner, Susan Pintzuk, andTerttu Nevalainen. Parsed Corpus of Early EnglishCorrespondence, first edition. Oxford Text Archive, 2006.2.2 million words

A sample of other languages, I Eiríkur Rögnvaldsson et al. Icelandic Parsed Historical Corpus(IcePaHC), version 0.9, 8/2011. (http://linguist.is/icelandic treebank/Icelandic Parsed Historical Corpus (IcePaHC)) 1 million words France Martineau et al. MCVF Corpus of Historical French.University of Ottawa, 2010. (http://www.arts.uottawa.ca/voies/) 1 million words Charlotte Galves et al. Tycho Brahe Corpus of HistoricalPortuguese,. University of Campinas, São Paulo, Brazil, 2010.(http://www.tycho.iel.unicamp.br/ tycho/corpus/en/) 2 million words, 0.8 million parsed to date

Other languages, II Prashant Pardeshi et al. NINJAL Parsed Corpus of ModernJapanese (NPCMJ) & Keyaki Treebank 500K words Christina Tortora et al. The Audio-Aligned and Parsed Corpusof Appalachian English (AAPCAppE) 1 million words Christina Tortora et al. A Corpus of New York City English(CUNY-CoNYCE).multi-million word corpus under construction

Coding queries

A canonical order sentence( (IP-MAT (NP-SBJ (NPR John))(VBP likes)(NP-OB1 (N pizza))(PUNC .)))

A topicalized sentence( (IP-MAT (NP-OB1 (N Pizza))(PUNC ,)(NP-SBJ (NPR John))(VBP likes)(PUNC .)))

A verb-second (V2) sentence( (IP-MAT (NP-OB1 (N Pizza))(VBP likes)(NP-SBJ (NPR John))(PUNC .)))

A coding query examplenode: IP-MAT*ignore nodes: PUNC \**coding query:// grammatical status of first constituent coded in column 11: {subj: (IP-MAT* iDomsFirst NP-SBJ*)obj:(IP-MAT* iDomsFirst NP-OB1*)temp: (IP-MAT* iDomsFirst *-TMP)-: ELSE}

Coding query, column 2// position of finite verb2: {\1: (IP-MAT* iDomsNum 1 finite verb)\2: (IP-MAT* iDomsNum 2 finite verb)\3: (IP-MAT* iDomsNum 3 finite verb)-: ELSE}

Coding query, column 3// subject-verb inversion?3: {subj-fin:(IP-MAT* iDoms NP-SBJ*)AND (IP-MAT* iDoms finite verb)AND (finite verb precedes NP-SBJ*)fin-subj:(IP-MAT* iDoms NP-SBJ*)AND (IP-MAT* iDoms finite verb)AND (NP-SBJ* precedes finite verb)-: ELSE}

Coding query, column 4// status of subject4: {pron:(IP-MAT* iDoms NP-SBJ*)AND (NP-SBJ* iDomsOnly PRO)}np:(IP-MAT* iDoms NP-SBJ*)-:ELSE

The canonical order sentence, coded( (IP-MAT (CODING-IP-MAT subj : 2 : subj-fin : np)(NP-SBJ (NPR John))(VBP likes)(NP-OB1 (N pizza))(PUNC .)))

The topicalized sentence, coded( (IP-MAT (CODING-IP-MAT obj : 3 : subj-fin : np)(NP-OB1 (N Pizza))(PUNC ,)(NP-SBJ (NPR John))(VBP likes)(PUNC .)))

The V2 sentence, coded( (IP-MAT (CODING-IP-MAT obj : 2 : fin-subj : np)(NP-OB1 (N Pizza))(VBP likes)(NP-SBJ (NPR John))(PUNC .)))

Extracting coding strings for quantitativeanalysisRun a .q file with only the following single line:print only: CODING*

Coding at more than one nodeSometimes it is useful to combine coding strings thatCS generates at more node, for example at IP and atNP. It is possible to concatenate the strings into asingle string.This possibility requires the use of a function, calledconcat in the revision query module of CS, which wedescribe later on in this presentation.

The output for our toy examplesubj : 2 : subj-fin : npobj : 3 : subj-fin : npobj : 2 : fin-subj : np

Some more realistic output subj-fin:npobj:3:subj-fin:protemp:3:subj-fin:np

Importing coding strings intoan R dataframe

A case study: the rise of recipientpassives in English (Bacovcin 2012)

Theme passives and recipient passivesin Modern English(1) *John gave the books to Mary.(2) *The books were given to Mary (by John).(3) *John gave Mary the books.(4) *Mary was given the books (by John).(5) *The books were given Mary (by John).

Theme passives and recipient passivesin Modern German(1) *Hans gab der Maria den Artikel.DATACC(2) *Der Artikel wurde der Mary (von Hans) gegeben.(3) *Die Maria wurde den Artikel (von Hans) gegeben.(4) *Der Maria wurde der Artikel (von Hans) gegeben.

Ditransitive sentences in Early Middle English(1) *John gave Mary the book.(2) *John gave the book to Mary.(3) *John gave to Mary the book.(4) *John gave the book Mary.

Theme passives and recipient passivesin Early Middle English(1) *The books were given to Mary (by John).(2) *The books were given Mary (by John).(3) *Mary was given the book (by John).

German double accusatives(1) *Hans hat die Kinder Geschichte gelehrt.ACCACC(2) ?Hans hat den Kindern Geschichte gelehrt.DATACC(3) *Geschichte wurde die Kinder gelehrt.(4) *Geschichte wurde den Kindern gelehrt.(5) *Die Kinder wurden Geschichte gelehrt.

Rise in the use of prepositional indirect objects in English1.0Type0.5ACC DATDADAT ACC%PPDAT ACCACC DATADNIONoun Indirect ObjIONounPronoun Indirect ObjIOPronoun0.0 0.5750100012501500Year1750

Rise in recipient passives in English% Recipient 750

Markov Chain Monte Carlo simulations of the change1.00Rate of change per year0.75ModelSMOOTHEDDATAGAMModel 1MODEL1Model 2MODEL2Model 3MODEL30.50MODEL4Model 4ContextACC DATAD0.25DAT 0

Revision queries

Revision query 0.0:Concatenating coding stringscopy corpus: tquery:(NP* iDoms CODING-NP*)AND (CODING-NP* iDoms [1]{2}.*)AND (NP* IDoms CP-REL*)AND (CP-REL* iDoms CODING-CP-REL*)AND (CODING-CP-REL* iDoms [2]{1}.*)concat{2, 1}:

English annotation for modals:Monoclausal structure( (IP-MAT (NP-SBJ (PRO They))(MD will)(VB come)(ADVP-TMP (ADVR later))))45

Romance annotation for modals:Biclausal structure( (IP-MAT (NP-SBJ (PRO They))(MD will)(IP-INF (VB come)(ADVP-TMP (ADVR later)))))46

Revision query 1.0:From monoclausal to biclausal structurenode: ROOTquery:(IP-* iDoms MD)AND (IP-* iDoms [1]{1}.*)AND (MD iPrecedes [1].*)AND (IP-* iDomsLast [2]{2}.*)add internal node{1,2}: IP-INF

But what about punctuation?( (IP-MAT (NP-SBJ (PRO They))(MD will)(VB come)(ADVP-TMP (ADVR later))(PUNC .)))48

Revision query 1.1:Ignoring punctuationnode: ROOTignore nodes: PUNCquery:(IP-* iDoms MD)AND (IP-* iDoms [1]{1}.*)AND (MD iPrecedes [1].*)AND (IP-* iDomsLast [2]{2}.*)add internal node{1,2}: IP-INF

Revision query 2:From biclausal to monoclausal structurenode: ROOTquery:(IP-* iDoms MD)AND (IP-* iDoms {1}IP-INF)AND (MD iPrecedes IP-INF)delete node{1}:

ECM annotation( (IP-MAT (NP-SBJ (PRO They))(VBD saw)(IP-INF (NP-SBJ (PRO him))(VB arrive))))51

Accusativus cum infinitivo annotation( (IP-MAT (NP-SBJ (PRO They))(VBD saw)(NP-OB1 (PRO him))(IP-INF (VB arrive))))52

Revision query 3.0:From ECM to A.c.I.node: ROOTquery:(IP-* iDoms IP-INF)AND (IP-INF iDoms {1}NP-SBJ)move up node{1}:replace label{1}: NP-OB1

Revision query 4.1:From A.c.I. to ECMnode: ROOTquery:(IP-* iDoms {1}NP-OB1)AND (IP-* iDoms {2}IP-INF)AND (NP-OB1 iPrecedes IP-INF)move to{1,2}:replace label{1}: NP-SBJ

But we don’t want to revisecases of object control( (IP-MAT (NP-SBJ (PRO They))(VBD persuaded)(NP-OB1 (PRO him))(IP-INF (TO to)(VB come))))55

Revision query 4.2:Restricting the revision to matrix “saw”node: ROOTquery:(IP-* iDoms {1}NP-OB1)AND (IP-* iDoms V*) AND (V* iDoms saw)AND (IP-* iDoms {2}IP-INF)AND (NP-OB1 iPrecedes IP-INF)move to{1,2}:replace label{1}: NP-SBJ

Revision query 4.3:Using iDomsModnode: ROOTquery:(IP-* iDoms {1}NP-OB1)AND (IP-* iDomsMod V* saw)AND (IP-* iDoms {2}IP-INF)AND (NP-OB1 iPrecedes IP-INF)move to{1,2}:replace label{1}: NP-SBJ

A trivial definitions filesee:see* saw

Revision query 4.4:Using the trivial definitions filenode: ROOTdefine: trivial.defquery:(IP-* iDoms {1}NP-OB1)AND (IP-* iDomsMod V* see)AND (IP-* iDoms {2}IP-INF)AND (NP-OB1 iPrecedes IP-INF)move to{1,2}:replace label{1}: NP-SBJ

A less trivial definitions filefeel:feel* felthear:hear*let:let*see:see* sawECM-verb: feel hear let see

Revision query 4.5:Using the less trivial definitions filenode: ROOTdefine: less-trivial.defquery:(IP-* iDoms {1}NP-OB1)AND (IP-* iDomsMod V* ECM-verb)AND (IP-* iDoms {2}IP-INF)AND (NP-OB1 iPrecedes IP-INF)move to{1,2}:replace label{1}: NP-SBJ

End

A Corpus of New York City English . *Geschichte wurde die Kinder gelehrt. (2) ?Hans hat den Kindern Geschichte gelehrt. DAT ACC (4) *Geschichte wurde den Kindern gelehrt. (5) *Die Kinder wurden Geschichte gelehrt. Ris