Web page preservation is a common task. Choosing how to do it is one of the most important aspects of this task.
In this post we’ll look at two different methods, and test drive tools used for each: Screenshots and Web Archives.
Web pages are generally easy to admit into court: The witness testifies that they viewed a web page and offers a screenshot as a replica of what they saw. This works in many cases, but it’s important to know when to use this evidence and why.
There are multiple tools for capturing pages and different reasons for using each.
All web pages are not the same
To be admissible, screenshots need to satisfy the rules of evidence. In the federal courts, this means the screenshot needs to be authentic and relevant to the case. Prosecutions unravel when one of these elements is not met; the element that is easily attacked is authentication. So let’s consider what authentication means for a web page and when a screenshot may lack it.
There’s more to a web page than what you see in the browser or, for that matter, a screen shot. Browsers often make requests from multiple locations to render the view. Images, videos and scripts can be hosted by different parties and on different servers. And, depending on the website, this view can change depending on the viewer or by the moment. Content may depend on the viewer’s geographic location, browser cookie data, social network connections, and keywords within the page. Intentionally or not, the browser view and the HTML that renders it can be altered.
Investigators need to be specific about what they are preserving and consider how it is being rendered. Screenshots may fall short when that focus is multimedia or when the content is dynamically changing.
The date and time a page was viewed is always relevant to producing web evidence. Presenting this key data without withholding the full page or an accurate representation is just as important.
Investigators aren’t always browsing for the purpose of taking screenshots. Investigators should have a plan for when the opportunities arise like a new social media post or changes to a page that may be temporary in nature. There are different tools for different tasks and for screen shots, Fireshot is one built for evidence collection.
Fireshot is a freemium tool; the $40 licensing fee delivers useful features to customize the watermarks. Fireshot captures full pages or browser windows in PDF and image formats. The watermarking options cover many of the authentication essentials: the date, time, URL, page title and, if you choose, the person who captured it.
Browser window captures are helpful when the focus is a specific element on the page. For example, you may want to hover over a hyperlink to highlight its destination or display an HTML publisher code next to the advertising banner. Browser inspectors like Firebug highlight this data and they can log the IP addresses and the HTTP requests behind the scenes. In these instances, the browser capture adds relevance to the bigger picture.
Firefox offers several add-ons that can also document the date, time and location of a page for browser capture or desktop recording.
Displays IP address of the page being viewed, as well as the user if relevant.
Can be customized the viewer’s local date and time, as well as the city location.
Display Window Title
Displays the page title found in the HTML on the browser tab.
Make Address Bar Font Size Bigger
Ensures the URL in the address bar is readable.
Firebug & NetExport
Inspect HTML. Log and export HTTP requests.
Fireshot is easy to use and access, but it does have some downsides. It’s subject to memory issues on large pages where it simply doesn’t work. It’s also important to remember that the browser is just a tool and that view can be subject to many things. Even the add-ons can affect the view or be used to edit the text, images and video embeds.
A web page is made of many separate pieces. Screenshots fall short when the focus is specific to a piece or when an image cannot corroborate the pieces as a whole.
Screenshots are great for supporting documents and general observations. But, they’re not working copies. You cannot navigate the page, examine links, verify or compare the content against the original resources. This is because the images are flat and web pages are dynamic. This is especially important with complex websites or significant changes to a web page where the evidence can be enhanced with a working archive.
There are several tools that archive websites and pages. The way these tools work makes a difference.
HTTrack is a crawler that makes backup copies of websites for offline viewing. The copies are made by harvesting HTML files and scripts which are packaged into a local folder.
A backup retains much of the look and feel of the original website. This makes it easy to conduct analysis later and it ensures a copy is retained after the investigation is revealed. The problem with backups is that some of the copied pages will always rely on externally hosted scripts, images and video links from the HTML. Because of this, the offline page may not be identical to the web page when it was saved. More importantly, the backup is not a true preservation because it relies on these links and an Internet connection, which alters the evidence. As such, backups are often viewed as supplemental evidence.
So, what is the proper way to preserve these many pieces? Many experts and some recent court cases point to a web archive format known as WARC. WARC is a revision of the format used by Archive.org to preserve the web and it’s seeing some increased use in evidence collection.
WARC is both a log and a recording. It bundles all of the details from a web page request into a single file. This data breaks down each request for a script, image and video. It documents the date and time of each request, the IP address for the server hosting each object, and the metadata or details describing the status of the file. These details include the date and time the file was uploaded or last modified, the size of the file and any unique tags attached to it. The WARC can be can be created with add-ons like Warcreate and played back through an application or an online service like Webrecorder.io. The raw file can also be viewed with a browser or Notepad++.
The problem with WARC is the format itself. For most users, it’s not as easy to capture or playback as a screenshot. Warcreate is the only browser add-on available for capturing these files, but in our tests it also preserved other pages from our browsing history. And, like the Wayback Machine, WARC files are unable to recreate some widgets and media like advertising, images and videos. This is especially important when that object is essential to your evidence.
That’s where yet another archive format comes in handy: MHT. MHT fills the gap between WARC files and backups by bundling text and graphics together in a single file. It’s actually an old Internet Explorer format and it can deliver the look and feel of the web page without an Internet connection. MHT works especially well with social profiles, individual posts or web pages. In fact, a complete profile can be saved by opening and saving each page together through the multiple tab function. For platforms like Facebook and Twitter, users will need to load the entire page before saving by scrolling down to load any cached items. The Firefox MHT add-on places an icon next to your Fireshot button for quick access and easy replay on the browser. MHT is not a replacement for WARC, but it is a great supplement on its own.
Web preservation services
Chances are you’ll capture some evidence in more than one format and there is a way to do this in a single click. Web preservation services automatically capture screenshots, HTML and WARC files together. Most are subscription services like Webpreserver.com and they strengthen the package by also providing hash values and certificates. Web preservation services are great for high-profile and complex investigations and they’re usually retained by law firms. Webpreserver is currently available to the public free in Beta and it’s well worth the visit.
There are many options to web preservation. The decision relies on the focus and the circumstances at hand. The chart below summarizes some of the pros and cons for the different formats discussed above.
|Method||Type||Native format||Date Time||Operator
|File Headers||Meta Data||Ads / Videos||Resources||IP Addresses||User
• Ads are often published dynamically, based on viewer’s location or subject to real-time bidding
There are best practices for screenshots and archives. You can read more about it through resources like the ones listed below.