README.md 2.67 KB
Newer Older
Dominic Kempf's avatar
Dominic Kempf committed
1 2
A small C++ tool to calculate pairwise distances between gene sequences given in fasta format.

Dominic Kempf's avatar
Dominic Kempf committed
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
# Python interface

To use the Python interface, you should install it from PyPI:

```
python -m pip install hammingdist
```

Then, you can e.g. use it in the following way from Python:

```
import hammingdist

# This accepts excactly the same two arguments as the stand-alone
# executable: A fasta file and the maximum number of sequences to consider
data = hammingdist.from_fasta("example.fasta", 100)

# The distance data can be accessed point-wise, though looping over all distances might be quite inefficient
print(data[14,42])

# The data can be written to disk and retrieved:
data.dump("backup.csv")
retrieval = hammingdist.from_csv("backup.csv")

# Finally, we can pass the data as a list of strings in Python:
data = hammingdist.from_stringlist(["ACGTACGT", "ACGTAGGT", "ATTTACGT"])

# When in doubt, the internal data structures of the DataSet object can be inspected:
print(data._data)
print(data._distances)
```

Dominic Kempf's avatar
Dominic Kempf committed
35 36
# Prerequisites

Dominic Kempf's avatar
Dominic Kempf committed
37
The following software is currently required if you build from scratch:
Dominic Kempf's avatar
Dominic Kempf committed
38 39 40 41

* git
* CMake >= 3.11
* A reasonably new C++ compiler
42
* Python 3
Dominic Kempf's avatar
Dominic Kempf committed
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

# Building

This sequence of commands lets you start from scratch:

```
git clone https://gitlab.dune-project.org/dominic/covid-tda.git
cd covid-tda
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make distance
```

This should (successfully) build the executable `distance` in the `build` subdirectory.
Dominic Kempf's avatar
Dominic Kempf committed
58 59 60
If `cmake` picks up the wrong compiler, it can be explicitly enforced by adding
`-DCMAKE_CXX_COMPILER=<path-to-compiler>` to the cmake call (in that case it is best
to remove the build directory and start from scratch).
Dominic Kempf's avatar
Dominic Kempf committed
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76

# Running

The tool can be run from the command line:

```
./distance <path-to-input> <n>
```

Here, `<path-to-input>` must point to a fasta dataset, e.g. by putting `../data/example.fasta`
after putting a data file in the `data` directory. `n` is the maximum number of gene
sequences that the tool should read. This can be a smaller number than the number of
gene sequences in the dataset.

The output is currently written to a file `distances.csv`. The output is a full
matrix, not only the triangular part of the symmetric matrix.
77 78 79 80 81 82 83 84

# Building the Python interface

This sequence of command should build the Python interface:

```
git clone --recursive https://gitlab.dune-project.org/dominic/covid-tda.git
cd covid-tda
85

86 87 88 89 90
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make hammingdist
```
91 92 93 94 95 96 97 98 99

# Deploying the Python interface

In order to do this, `docker` needs to be installed and the permissions for `docker`
must be given. Then, the deployment process should be automated like this:

```
./bin/deploy.sh
```