GUEST SPEAKER: Ian Milligan, University of Waterloo (Understanding the needs of researchers who want to use web archives as a data set
What are your thoughts on how researchers are using/want to use web archives? How has this informed Warcbase development?
What was your process for determining the best way to access, store and interrogate the data set when you were working with Archive-It?
Is there a training component currently in place at Waterloo for faculty and students who want to start using Warcbase? If so, where is this training/instruction provided - through the library?
What recommendations do you have for people just starting web-archiving programs? More specifically, what parameters/workflows/specifications should new web-archiving programs set in order to optimize usability and access for future researchers?
Round Robin:
Upcoming conference presentations?
Article/publication CfP's of interest?
Policy development
Repository progress
Campus trends/initiatives and how those may be impacting repository activities
Discussion items
Time
Item
Who
Notes
Round Robin
Abby:
Porter Olsen is coming to UT from UofM to do research using Gabriela Garcia Marquez and this will help with coming up with access policies for born-digital archival material using a real use case
Laura:
Studies the preservation of Islamic cinema from the 1970s and 1980s - eventhough there is a state-run means for preserving film archives, because the country is restrictive in what foreigners access there has not been a lot of oversight on the actual preservation strategies or efforts being employed there
When officials in the new regime, post-1979, there was a lot of stuff that was destroyed or damaged
Films are int he hands of the official archives and also private collectors that maintain and reformat and now those are popping up online on websites that are not strictly accessible in Iran
Shannon:
They don't have a lot of digital material but working on collection development in that are
Ashley
Just received a large collection of disks from an anthropologist mostly from the 1990s - the transcripts to the oral history are on those floppy disks and there is a researcher coming to do research; working on an access solution
the primary preservation storage approach is vaulting things to tape but the tape drive has failed and is being updated
Marianna
Joined the data management committee - focused on institutional data
Documentum for faculty and staff
Enforces records management laws
Might go live next summer - August of 2017
Academic data is not included because of IP for students and faculty
Iraq invasion
So much data but not alot of acces, not clear on how to use them
Iraqi issue and Kurdenstan issue
They have a lot of artifacts abotu the
Ian Milligan
We need to
Web archive tool and development
Warcbase is a web archiving platform - speed up access to wyaback machines; its a way to analyze web archives
works really well on a raspberry pi, personal laptop or in a cluster
in a nutshell - it takes a warc file and allows you to:
look for length structure
topic modeling
setting up the data
network graph
They've been developing Warcbase 3 years ago - with Jimmy Lin in computer science
Research use cases:
IIIPC - the focus up until the last few years has been about grabbing everything and not a lot of thought on access
there is a survey coming out and next week at IIIPC they will present stats on why or why not people are using archives
Current researchers in LUNA, UK Archive people,
Michigan just ran a great conference on web archiving research
Basic sense of researchers want:
you need more than the wayback machine - you have to know the url, you can only browse at one page at a time
we need to scale up
while we need search, the search needs to be intelligible
Keyword query with no prioritication of results
The goal of Warcbase is to make things translaret because if
Collaboration between computer scientists and historians - it is working well because the team is half CS and historians
Historians will say, "I would really like the power to do X" - they create a ticket, CS responds tot he tickets and they respond - research questions actually guiding the development
the importance of doing things open source
hackathon last month - about building a community that gains enough momentum
Alot of inital development required the researcher to dig in to tool - what they are trying to do is use JUpiter notebooks (ideally they have someone to spin p the notebooks - vagrants), point the webbrowser at it and then
goal of development - people use jupiter to prototype what they want to do and then paste that code into the shell
publishing on it - working in an interdiplicinary cgroup (librarians, historians, and CS) - the key to success - everybody wants to publish in different scholarly communities
technical stuff is ending up
librarian presented at code4lib and their journal
arts and humaniteis computing
published an early web archiving piece in
trying to put something together for digital humantiteis quarterly
he is working on what it was like being a kid building a webiste in the 1990s
everybody has to be happy - jimmy needs a reason to collaborate, ian does and the librarian does - tenue
training - they are running a pilot training workshop in iceland, if that works they want to make that universal
one of the goals this summer, RAs that don't have
software carpentry workshop - python, github, etc.
relationship to the library:
will it continue to to be run like an open source community or hosted in the libraary - he prefers a consortial model for providing resources to support
not having faculty status for librarians at waterloo means they don't have the bandwidth for research
what can we do to imporve:
the most important thing is documentation - if you guys are setting up a collection, you need to have your seed list written down and why we decided to collect what we decide to collect
the canadian political parties collection was set up in 2005 but the libraian that set it up and there was no documentation so if i publish a peer reviewer will immediately go why these sites and not those
beyond documentation, the debate between using hertirtx yourself or archive-it - archive-it has a great community, but nick has been doing heretrix by himself and that allows him to run that on an experiemental basis
advertsiing is good - university of toronot - you need to advocate and outreach we have this stuff - and if you want to use it or point them to the warcbase workfhop
social media is super important - the debate is should you use archive-it or the twitter api to downloadthe data - intersection between
this summer, they used that SHINE program that the uk launched, it just provided faceted search - columbia human rights also uses a faceted search engine
wayback machine - simple archive it keywords it is overwhelming, if you have good metadata you can start faceting down (dates, languages, subjects) - autoextracted metadata
they found that was cool and they got a lot of news coverage - people then go, this cool but it begs the question for something more sophisticated - the gateway to warcbase
the issues is the underlying data for websites are messy you have to
he works with the socialoogust and network scholars and matt webber at rutgers - when they look at web archives, he wants content and they want the graphs
social scientists want to look at networks, entities and things in a non-digital way
hisotrical contribution longitudinal demention
cs what you need now
historians are looking for change over time
in a dreamworld we would love to see other folks using it - so if we use it, we should be in contact with the team; open source stack
if we decide to
the fun of archive-iit and do funky stuff with it
how researchers get access across instutional collections - they have been exploring UT legal team - what kind of MOU do we need to ask for warc files and
U of Victoria and Alberta have been totally open to it - with the caveat of citations for the source and not sharing the warc files; when they create public facing things, they allow the libraries to review them
when they finally get to the document, it will be cited as Univerity of Toronto; he shared a sample mou document (between donating university and the PI on the grant) - same with broader consortial project - agreement between Ian and the UNiverity of alberta