Million Books Workshop
Friday, 14 March 2008
Imperial College Internet Centre
Workshop Theme and Topics
The Million Books workshop will explore issues raised by the mass digitisation of our libraries. We will focus on the general problems and opportunities of very large collections of books, digitised only as page images and with text automatically generated by Optical Character Recognition software. This discussion will build on earlier Million Books meetings in May and November 2007 at Tufts University and the Council for Library and Information Resources (CLIR) in Washington, DC. The London workshop and a subsequent workshop in Berlin will conclude a study on this general topic. The
report from the November CLIR.org meeting will provide the starting point for discussion.
The focus of this workshop will be classical antiquity, especially as reflected in its major languages—Latin and Ancient Greek—and their long afterlife in the archives of European culture. The large quantity of European cultural heritage data referencing classical antiquity that will become available as a result of mass digitisation poses an immense challenge for digital collections on every level from knowledge management to systems engineering. The workshop seeks to bring experts in content, user access, technologies and policy together to address these fundamental issues in a holistic way.
We will examine five basic questions.
-
What are the uses of very large collections that were not feasible before? Specifically, what new research questions can be asked? What cross-domain uses are possible?
-
What services will we need to support these new uses?
-
What kinds of collections will we need to develop and to maintain?
-
What systems will support these services?
-
How will we realise these services in the real world?
We will begin with an overview of the current technical state of the art for large corpora moderated by experts in core technology. In the afternoon session, a roundtable discussion will identify and explore the challenges and opportunities created by mass digitisation, both for Latin and Ancient Greek, and for the many disciplines and domains in which they play a role. We will conclude with a general discussion about the way forward.
Programme
9:00 Coffee
9.30: Welcome John Darlington, Director, Imperial College Internet Centre.
9.40: Introduction and Overview: Gregory Crane, Perseus Project
10.00: SESSION I: Services
Dr. Thomas Breuel, DFKI and Technical University Kaiserslautern, –From Image to Text: OCR and Mass Digitisation.
10.45: Coffee
David Smith, Johns Hopkins University, & David Bamman, Perseus Project – From Text to Information: Machine Translation and Syntax Recognition.
David Mimno, U Mass, Amherst.
– From Information to Learning: Machine Learning and Classification Techniques
1.00: Lunch
2.30: SESSION II: Collections
Roundtable discussion Moderator: Gregory Crane
4.00: Coffee
4.15: SESSION III: Systems and Infrastructure
Open Forum
6:00 Concluding Summary Gregory Crane
7:00 Dinner
Registration
Participation in the workshop is by invitation. A report will be made publically available.
For further information about the Workshop Programme, please contact Brian Fuchs (b.fuchs@imperial.ac.uk).
How to find us.
Background Material
Million Books Chicago Statement
Many More than a Million: Building the digital environment for the age of abundance