Table of Contents .................................................................................................... vii
Preface ..................................................................................................................... 1
WAC3 ..................................................................................................................... 3
Kevin P. SCANNELL, The Crúbadán Project: Corpus building for underresourced
languages ..........................................................................................5
Sebastian BLOHM, Philipp CIMIANO, A Human Evaluation of Filtering
Functions for Pattern-based Extraction of Arbitrary Relations from the
Web .....................................................................................................................17
Emmanuel CARTIER, TextBox, a Written Corpus Tool for Linguistic Analysis ...... 33
William H. FLETCHER, Implementing a BNC-Compare-able Web Corpus ............ 43
Fabrice ISSAC, Yet Another Web Crawler ................................................................ 57
Igor LETURIA, Antton GURRUTXAGA, Iñaki ALEGRIA, Aitzol EZEIZA, CorpEus,
a 'web as corpus' tool designed for the agglutinative nature of Basque ...........69
Serge SHAROFF, Classifying Web corpora into domain and genre using
automatic feature identification .........................................................................83
Anil Kumar SINGH, Jagadeesh GORLA, Identification of Languages and
Encodings in a Multilingual Document ............................................................. 95
CLEANEVAL .......................................................................................................... 109
Daniel BAUER, Judith DEGEN, Xiaoye DENG, Priska HERGER, Jan GASTHAUS,
Eugenie GIESBRECHT, Lina JANSEN, Christin KALINA, Thorben KRÜGER,
Robert MÄRTIN, Martin SCHMIDT, Simon SCHOLLER, Johannes STEGER,
Egon STEMLE, Stefan EVERT, FIASCO: Filtering the Internet by Automatic
Subtree Classification, Osnabrück ..................................................................... 111
Stefan EVERT, StupidOS: A high-precision approach to boilerplate removal ........ 123
Weizheng GAO, Tony ABOU-ASSALEH, GenieKnows Web Page Cleaning
System ................................................................................................................. 135
Christian GIRARDI, Htmcleaner: Extracting the Relevant Text from the Web Pages ..... 141
Katja HOFMANN, Wouter WEERKAMP, Web Corpus Cleaning using Content
and Structure ...................................................................................................... 145
Michal MAREK, Pavel PECINA, Miroslav SPOUSTA, Web Page Cleaning with
Conditional Random Fields ............................................................................... 155
Xabier SARALEGI, Igor LETURIA, Kimatu, a tool for cleaning non-content text
parts from HTML docs ....................................................................................... 163