PanProteome Aligner is a tool I wrote that build graphs from amino acid sequences and can align back both DNA and AA sequences, and outputs alignments in the GAF format. The idea is that we can leverage the fact that amino acids are more conserved and build individual graphs for each gene from the amino acid sequence, and be able to align sequences coming from a longer distance organism on the phylogenetic tree. The tool can be found Here and some documentation can be found here ReadTheDocs, the preprint that has been accepted to Bioinformatics Advances can be found Here
This was done during my master's thesis period with Tobias Marschall. I made a tool that detects Bubble and Superbubble chains in genome graphs. It was done in Python3 and works relatively fast. It finished in around 20 to 25 minutes on a human genome-size graph (around 25 million nodes) with around 25 Gb memory consumption. Here. This work has been published in Bioinformatics, and can be found Here
I recently wrote this small tool in Python3 that takes a Multiple Sequence Alignment MSA and turns that into a graph file in GFA v1 format and adds paths to the file that corrisponds to the original sequences in the MSA, in case one wanted to visualize this in Gfaviz for example, the path will be colored then. The code is discribed Here
I extended a C++ code written by Guillaume Holley and added a Python code to make upset plot from the output. The idea is that if you gave for example 5 genomes to Bifrost and you built a pangenome with a colors file, the c++ code can output the different intersections of k-mers between these genomes, my upset plot script then can take this input and make a plot with it. The code is discribed Here
During my work at Max Planck for informatics, I worked on a pipline that tries to automates the checking and fixing of metadata related to samples from the DEEP project (The German Epigenetics Project), especially that many labs were involved, a problem with metadata consistancy arose. We needed the metadata to be consistant especially when building statistical models from the samples, different ways a feature is written can be then considered as an actually different feature mathematically. E.g. If the same sequencer was used on some samples, but the name was different a bit, a statistical model in R for example, might consider those as different features. The pipeline was presented as a poster at the IHEC meeting in Berlin in 2017 Here, poster number 40. And the pipeline written in Python can be found Here.
This Shiny app in R was writted to retrieve, visualize and do batch effect analysis on epigenomic data retrieved from DeepBlue servers. It was disigned to automate the batch effect analysis and make the visualization easier. Implementation can be found here Here.