1 Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
3 Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
4 been contributed by various people using NLTK for sentence boundary detection.
6 For information about how to use these models, please confer the tokenization HOWTO:
7 http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
8 and chapter 3.8 of the NLTK book:
9 http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
11 There are pretrained tokenizers for the following languages:
13 File Language Source Contents Size of training corpus(in tokens) Model contributed by
14 =======================================================================================================================================================================
15 czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
17 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
18 danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
19 (Berlingske Avisdata, Copenhagen) Weekend Avisen
20 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
21 dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
22 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
23 english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
25 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
26 estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
27 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
28 finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
29 Text Bank (Suomen Kielen newspapers
31 Finnish Center for IT Science
33 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
34 french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
36 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
37 german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
41 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
42 greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
43 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
44 italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
45 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
46 norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
47 (Bokmål and Information Technologies,
49 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
50 polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
52 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
53 portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
54 (Brazilian) (Linguateca)
55 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
56 slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
57 Slovene Academy for Arts
59 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
60 spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
62 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
63 swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
64 (and some other texts)
65 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
66 turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
67 (Türkçe Derlem Projesi)
69 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
71 The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
72 Unicode using the codecs module.
74 Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
75 Computational Linguistics 32: 485-525.
77 ---- Training Code ----
80 import nltk.tokenize.punkt
82 # Make a new Tokenizer
83 tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
85 # Read in training corpus (one example: Slovene)
87 text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
92 # Dump pickled tokenizer
94 out = open("slovene.pickle","wb")
95 pickle.dump(tokenizer, out)