Saving the data with the internet archive

Published / by main_concerned_scientist / Leave a Comment

Over the last couple of weeks the data saving effort from the scientific community has been revving up. Here is the comprehensive list, updated as things come into my sight.

General protocol for data rescue through internet archive:

https://github.com/datarefuge/workflow/blob/master/README.md

Environmental data:

https://envirodatagov.org/ https://www.datarefuge.org/

Publicly backing-up scientific datasets that can be adversely altered

Published / by Andrei Kucharavy / Leave a Comment

This guide is designed to guide you through a process of creating public redundant copies of scientific data you feel are at risk of being adversely tampered by someone with authority in your institution.

Who?

I am a Ph.D. student in computational biology at Paris University 6 and JHU. This article is the result collaborative effort between several information security – knowledgeable people, that I have laid in writing. Mehdi El Mhamdi of Mamfakinch is one of them; others preferred to remain anonymous.

What?

We tried to cover several common scenarios when the data could be compromised or why it would need public backing-up:

  • You have access to the data that is currently public and is known to not have been compromised or doctored in any way. However, there is a chance it will be revised to better suit a narrative peddled by someone above and you would want to make it safe as well as to ensure its integrity
    • Use PGP signature to sign the data hashes to ensure that they will not be altered
    • Use BitTorrent/VPN to share the actual data with the world o Post torrent links on trackers and public forums where people who would want to keep a copy of your data would be
  • You need to transfer high-impact data and documents to journalists and or investigators anonymously and securely.
  • You need to securely and privately transfer a dataset to your colleague, either for safe-keeping, inspection or in case you cannot replicate the dataset yourself
    • OnionShare

This article is mostly a step-by-step guide for the first scenario.

The last two scenarios are fairly well-covered in the press already and are reasonably straight forwards. An article describing them in depth has recently been published in the Intercept

Limitations?

As per usual disclaimer, no tools or procedures are 100% fool-proof and errors code happen. Suggested procedures are safe to the extent of our knowledge, but proceed at your own risk and beware of legal implications of what you are about to do and whether you would be ready to weather them.

Short scenarios descriptions

Public data backup

Scenario:

You have access to a public scientific dataset. You have first-hand knowledge that this dataset has not been adversely edited. You believe that this dataset is at risk of being adversely edited in the future, to influence public opinion or to support a narrative. You would like to create a public back-up of the data. The data currently have a low social impact. Because of that, you are not at risk for sharing it, but few standard whistleblower channels or academic repositories are willing to take it.

Risks:

  • The data is present in a single/few copies that can be easily reached by authority willing to adversely alter them
  • The original data will be presented as a doctored version itself – Unaltered data is likely to be the subject of takedowns by any means available How the risks are mitigated:
  • Data is replicated as peer-to-peer torrents, notoriously difficult to root out, even with full legal authority
  • Data authenticity and integrity is guaranteed by you PGP-signing it’s hashes, or even bitcoin-stamping the hash.
  • Data is replicated by all the peers who are convinced of its importance, not only the central nodes that can be easily taken down

What to do:

  • Install gpg
  • Create a keypair
  • Upload your public key to a key server
  • Authenticate its fingerprint to people who might want to check your signatures in the future. Different ways of doing it include hosting key fingerprint on server you control, your Twitter bio or – best of all – by having other sign your keys to join a web of trust
  • Install BitTorent Client
  • Add the files you would like to back up to a torrent
  • Advertise your torrent (IRC/Twitter/[Academic Torrents]( http://academictorrents.com/), …)
  • Clearsign the hash of torrent and make it public – twitter/IRC/mail to people who would be likely to replicate it
  • You want to be particularly secure, time-stamp the torrent hash within the Bitcoin blockchain

Confidential high-impact data transfer

Scenario:

You have access to datasets or documents that are in stark contrast with existing narrative. They could reveal an ongoing fraud or data editing and carry a high social impact. Because of that, retaliations against you for sharing it are very likely and your anonymity and safety are of the highest importance.

Risks:

  • The data is not transmitted to the institutions that could have the best use of it
  • Your identity is revealed

How the risks are mitigated:

  • You use VPN servers and are running a Tor Network
  • You drop the data into a SecureDrop Box of an institution that would have the authority and resources to make sure the data you’ve provided leads to results

What to do

  • Find the SecureBox location of the institution you would like to contact
  • Establish a VPN connection with one of the reputable VPN providers
    • Establish a Tor connection over it
    • Upload the files you have to the SecureDrop Box

Secure transfer to colleague

Scenario:

You have a dataset or documents that are not considered public or that you would not like to render public. However, they are at risk of being altered and you would like your friend or colleague to keep a copy of it.

Risks:

  • The data is not transmitted to your colleagues
  • The data is transferred to someone else
  • Your personality or the fact that you have transferred the data is revealed.

How the risks are mitigated:

  • Use Tor and OnionShare to host a file transfer from your computer through Tor Network
  • Use PGP mail or signal to communicate to your colleague the Onion link to the file
  • Your colleague can use the link to securely download files through Tor without being able to figure out origin

What to do:

  • Establish a VPN connection with one of the reputable VPN providers
  • Establish a Tor connection over it
  • Start OnionShare and select files you would like to share
  • Send the link that was generated by OnionShare to your colleague by secure means (Signal, Telegraph, PGP mail, …)
  • Once the download is complete the OnionShare link goes down without a trace.

Detailed How to of public data back-up.

A couple of points of vocabulary:

  • PGP stands for Pretty Good Privacy – a general protocol with several implementations
  • gPG stands for GnuPG and implements the OpenPGP protocol RFC4880. Its source code is publicly reviewed and is watched very closely by a lot of independent security experts. It is universally accepted as safe and tamper-proof.
  • {your text} signals that you need to replace that part of the command with values relevant to your use case, for instance, your e-mail address or your key fingerprint

Setting up the infrastructure and publicly backing up a verified version of a dataset:

Checksum utility installation:

First of all, you will need to download and install a tool for calculating the hashes on your platform. Linux and Mac come with built-in terminal ability to calculate hash sum with a >sha256sum. On Windows, you will need to install one of many available tools. For instance, I use MD5-SHA checksum utility. However, some OS insert metadata into files that are likely to modify the hashes. Because of that, it is a good idea to sign the Sha1 hashes provided by BitTorrent clients instead.

gnuPG installation

First, if you don’t already have to, you will need to download gPG binaries from their official website. At the moment of writing of this article the SHA1 hash of the windows modern installer Gpg4win 2.3.3.exe is 67e13c4f90ff6a70ad57bd31af64a238c9315308. Mac version GPG Suite 2017.1b2 has SHA256 hash of f74fd4788cfa0820933499768fa7dfe1c0b295bbae9f43812dc3590923975de4.

Please make sure to allow gpg to be accessible from the command line. To avoid any confusion with the user interface, the rest of tutorial will always use the command line interface of gpg. From here, in order to avoid potential UI confusion, all the instructions will be going to be command lines and should be fairly similar between different platforms.

Primary key pair generation

> gpg --gen-key
  • for the key type, the default RSA, both for signing and encryption would do just fine
  • for the key length, 4096 bits would be the right length to ensure the best security
  • for the validity duration, set the expiration to 5 years. It will prevent key creep if you ever loose access to your key.
  • for the user ID, choose one that is consistent with the persona. Usually, it is of the form “Name Surname “. While you can add additional email addresses for which the key would be valid, your name is fixed forever. Because of that, if you want to use an anonymous handle, use it instead of your name and surname, but be aware that it would not be possible to change it in the future. There is as well no certification of the fact that the key actually belongs to the person whose name and email address is indicated, so you will need to authenticate your key, but I will talk about it later.
  • for the passphrase. This is very important one. On one hand, the passphrase is the last barrier between someone who has acquired a copy of your private key and their ability to impersonate you. On the other hand, the most frequent reasons users loose access to their PGP keys is due to forgetting the passphrase. This can happen to anyone and has happened to me already. Because of that, make sure to choose as strong and hard to guess a passphrase as possible, but think about ways of remembering it after you haven’t used gPG for a couple of months or years. Useless to say that keyboard Trojans render passphrase useless as a measure of protection, so make sure the computer you are using is not infected nor is monitored for keystrokes.

At that point, you have generated your private/public key pair. Type

> gpg --list-keys

You will see something along the lines of

pub 4096R/XXXXXXXX YYYY-MM-DD 
uid [ultimate] Your Name 
sub 4096R/ZZZZZZZZZ YYYY-MM-DD
  • YYYY-MM-DD are respectively the year, month and date of the key generation
  • 4096R is the type of the key – 4096 bit RSA in your case
  • XXXXXXXX is the ID of your primary signing public key – that’s the part that will be used in order to manually authenticate that it’s your key and that they key belongs to you. It will be designated in the future as {your key id}
  • ZZZZZZZZZ is the ID of the sub-key derived from your primary public key that will be used for encryption.

Adding additional mail addresses to your key

If you want to add additional mail addresses to your public key, this can be done by

> gpg --edit-key {your email address} 
> adduid {your additional email address} 
> trust 5 (since it's your own key)
> save

Generation of revocation certificate

This is the first thing you need to do after you have generated your key pair. It will be used to inform the others that your public key is no more valid, for instance, if your private key is compromised or if you have lost your passphrase and had to generate a new pair. The revocation certificate needs to be kept offline, preferably in a safe place, provided that it can invalidate all of your signatures. To do it, type:

> gpg --output revoke.asc --gen-revoke {your key id/your email address}

revoke.asc is now to be kept offline, somewhere secure.

Publishing your key

The next step is to publish your public key. Since you will be using the key to publicly sign hashes of the dataset whose security you are trying to ensure, people who want to check the signature will need to have access to your public key as well as a way to determine that the public key you’ve created and published is actually yours. The first part is quite easy to do. Public keys can easily be published on keyservers, where they can be searched and downloaded by others. Since they are public keys, publishing them there will not compromise your security. However sometimes the mails inside the keys are harvested by spammers, so make sure the mails you are leaving in the key have a solid anti-spam protection. to publish key, we will use the MIT PGP keyserver:

 > gpg --keyserver pgp.mit.edu --send-key {your email/key id}

Now your key has been published and you need to provide the way for other people to know that the published key actually belongs to you. Since you can enter any name and email address in the key, impersonation is extremely easy and in general, no public key except your own is to be trusted without external validation.

The most wide-spread way of certifying “owing” of a PGP key is publishing it on an online page that only have control on, that is tamper-proof and is communication with which is secured – i.e. it’s an httpS:// page. If you own a website you will associate with that data – publish your key ID there. Otherwise, Twitter bio is a common place where the PGP key IDs are stored is twitter bio handle.

A more secure and long-term viable way of ensuring others that a given public key actually does belong to you is to personally (over the phone or through in-person business card) transmit the fingerprint of your key to another person and have them sign your key, the re-publish it online. You can see the fingerprint of your key by typing:

> gpg --edit-key {your key ID/ your e-mail}
> fpr 

I will go more into details about how to sign other’s key later in that guide. I would suggest adding your public key fingerprint to your business card, even if you are not planning on backing up anything anytime soon.

Deleting an email from the public key

If you need to delete an email from a key you’ve created before publishing it, you can do it by

> gpg --edit-key {your email address/your key id}

enter the number of the identity you want to edit. The list is displayed with a * next to that ID #

> revuid

answer the question enter the passphrase and

> save - to save the changes

Installing BitTorrent client

BitTorrent is a general protocol allowing peer-to-peer communication. Unlike the services with a central point of failure (such as a central server, or a central account), the files are shared and preserved by anyone who thinks they are worth their space on their drives and taking those files out of circulation would require taking out every single of them. The most wide-spread client for Windows is uTorrent, whereas Transmission is the version that is more widespread on MacOS and Linux.

Because of its distributed nature and the difficulty of taking torrents down, BitTorrent is widely used for piracy, which might be the reason it would be blocked by your workplace or ISP. If this is the case, you will need to install a VPN.

(Optional) VPN Client

There are hundreds if not thousands of different VPN clients and service providers and a lot of them offer the degree of security you will need. Even if the more safety-conscious users tend to use an open-source VPN implementations tunneling to a server they are renting themselves, commercial clients and server systems are easier to use and are more reliable. Here are a couple ones compatible with BitTorrent: NordVPN VPNArea PrivateInternetAccess.

If you desire to build your own VPN tunnel to a server you are operating, OpenVPN is the most commonly used solution and several guides on how to install it on a VPS are available here: – DigitalOcean TutorialOpenVPN quickstart guide

If you are tunneling, servers in Switzerland/Germany have the best reputation for upholding the privacy of their users

(Optional) BitTorent configurations

In order to mitigate the blocking possibility, it is a good idea to enable ports randomization and that Windows Firewall exceptions are automatically added. It’s usually done in the Options>Preferences tab. Another thing you might want to do is to encrypt your traffic, to do in the preference tab as well. Finally, the last thing you would need to do is to make sure that the bad peers are blocked from connecting to your torrent. Since you aren’t sharing copyrighted material, you shouldn’t normally need that last one, but if you ever do enable it, make sure that educational institutions and universities are not blocked, because they might want to replicate your data. In order not to render your Internet connection completely unusable for anything else, a good idea might be to set global bandwidth limits for your Bittorent possibilities.

Converting the data to a BitTorrent and sharing it:

This part can be fairly easily done with the user interface of your BitTorrent Client. Choose the “create new torrent” option and choose the files that you would like to share. With respect to trackers, you will need at least one tracker that would register your torrent so that the other people can find. Just as keyservers, trackers replicate their indexes on other trackers, so adding your torrent to a single tracker in a public mode would be enough. Most common ones are: – udp://tracker.openbittorrent.com:80/announce – udp://tracker.publicbt.com:80/announce

A tracker designed specifically for sharing Academic data is the Academic Torrents

Signing the hashes

If you click on the properties of your torrent, you should see the SHA1 hash of it. Put it in a .txt file, along with the names of the data files to which they are associated, as well as a short description of what is contained in each file. Please use raw ASCII text, such as .txt files saved from notepad as opposed to more fancy formats – they will be easier to share in the long run.

After you done with that, cleartext-sign it:

> gpg --clearsign {your text file}.txt

The result will be saved as

> {your text file}.txt.asc

Publishing the signed hashes

Now you need to publish signed hashes in as many places as possible, preferably on different platforms and ensure they would be copied and saved by other people as well. Since they are published as ASCII cleartext, they can be copied directly from the web pages and verified by anyone who might need it as well.

Publishing link towards torrents and the signed hashes

Now, for peers who would want to replicate your data to be able to replicate your data, you actually need to reach out to them and transmit them both the signed fingerprints of your files as well as links towards the torrents. With respect to that, a lot of publicly notorious scientists and institutions have offered their service in backing up the data. It might be a good time to send them the link as well as the signed hashes of the data, as well as to your friends and peers in other countries you know would want to see the data you are trying to backup safe.

Finally, it might be as well a good idea to post the link towards it on your twitter account, or a twitter account of your anonymous identity, especially if they are linked and trusted by large associations dedicated to the data protection.

On my side, I will inspect everything published on ##sciencebyfacts Freenode IRC channel and publish on sciencebyfacts.org website as well as forward to other groups for the safe keeping.

(Aparte) IRC channels

IRC is a communication protocol allowing users to meet in chatrooms, operating without a central server and quite robust to adverse attacks, even by state-sponsored actors. Even if its popularity have significantly declined in the recent year, it’s still a solid tool for anonymous topic-specific communication. The easiest way to join an IRC channel from any platform is by installing the CIRC extension in Google chrome browser. Even if it doesn’t have as extensive of a user interface as other clients, it’s intuitive enough to use and gets the job done.

In general, it’s a good idea to package your torrent link and the cleartext signed hashes in a [pastebin](http://pastebin.com/) and paste the link to the channel.

(Optional) Time-stamping with Bitcoin Block Chain

If you would like to guarantee that a document was in a certain state at a given time, injecting the SHA1 fingerprint of your torrent into the Bitcoin BlockChain is as close as it gets to a notarial timestamp. Because of the BlockChain nature, the time cannot be falsified and as long as the BitCoin is alive, your timestamp cannot be erased or tampered with.

If you already using the bitcoin and are using the bitcoin, you can inject the SHA of your torrent into the OP_CODE field.

Otherwise, you can use existing timestamp providers, such as https://www.originstamp.org/ (free) or https://stampd.io (0.10$ per stamp).

Done

At that point the data you have been published will be picked up by your peers – everyone who will be convinced of the importance of the data, even if they don’t have access to any major infrastructure will be able to re-sign the hashes of your dataset, download and re-seed the data, effectively keeping it alive until it is needed and ensuring a good data integrity.

Now, you might want to join the community and participate in the safekeeping of the data from your peers.

Participating in the Data safe-keeping:

First, you will need to go ahead and install the stack described above – gPG with a valid and published keypair and BitTorrent client. If you find a dataset that feels important and trustworthy, go ahead, download it through BitTorrent, check if the sha1 corresponds to the signed version and become a seeder on it.

Obtaining public keys

If you have an email address associated with a key or ID of a key, you can look for it in the keyservers. Since they transfer the key between them, all the keys committed to one keyserver will end up in all the other after a time, so it doesn’t really matter which one you are using. pgp.mit.edu is usually a pretty good one.

> recv-keys --keyserver pgp.mit.edu {ID of your contact's key}

Signing others public key

If you have met the person who has initially signed the dataset in person and they, in person, transmitted you the fingerprint of their key, for instance on a business card, you can go ahead and sign their key, then re-upload it on the keyserver:

> gpg --edit-key {your personal contact's key ID here / e-mail} fpr

Please, check for the validity of the fingerprint and that the last 8 characters correspond to those in the ID

> sign gpg --keyserver pgp.mit.edu --send-key {your personal contact's key ID here}

This will allow the ownership of the key and the signature to be independently verified even by people who don’t necessarily have a mean of verifying the validity of their signature directly to check it through a “web of trust”.

Verifying a clearsign.

You can verify that the hashes were signed by the correct person by typing:

> gpg --verify {your text file}.txt.asc

If that’s your own file, you will be able to verify it right away. Otherwise, you will need to get the public key of the person who has signed it.

Re-signing hashes with your key

If you have personally validated the integrity of the datasets and that they correspond to your version, you can go ahead and re-sign the hashes as if it was your own backup of files and publishing the signed hashes as you would have your own.

Publish the hashes signed by you, as well as those signed by the original author as you would for your own datasets, share them, as well as the original torrent link on the channels on which you are trusted.

That’s it.

The data has been publicly backed up and verified and should be safe as long as there are enough seeders for it.

Bonus:

Mail encryption:

Enigmail plug-in for Thunderbird will allow you to use your newly created keys, as well as the public keys you’ve downloaded in order to send encrypted messages to them or receive them from them.

Secure messaging:

Signal is a good secure messaging application, that has an option for deleting messages after a certain time.

Links for more in-depth information:

  • GnuPG guide – creation of a key pair and publishing of a key: https://www.gnupg.org/gph/en/manual.html#INTRO
  • GnuPG guide – signing and verifying the signatures: https://steemit.com/pgp/@dhumphrey/how-to-clearsign-and-verify-a-message-using-pgp-gpg https://www.gnupg.org/gph/en/manual/x135.html
  • adding additional ids to a key: https://www.katescomment.com/how-to-add-additional-email-addresses-to-your-gpg-identity/
  • BitTorrent protocol: http://www.howtogeek.com/141257/htg-explains-how-does-bittorrent-work/
  • Using BitTorrent: http://www.howtogeek.com/howto/31846/bittorrent-for-beginners-how-get-started-downloading-torrents/
  • Adding file to BitTorrent: http://lifehacker.com/5534190/how-to-share-your-own-files-using-bittorrent
  • Importing other’s public key and signing it: https://www.gnupg.org/gph/en/manual/x56.html
  • And of importance of the Web of Trust: http://web.archive.org/web/20160220165445/http://elmahdielmhamdi.com/2014/12/20/public-key-encryption-and-the-web-of-trust/