Pierre Mallinjoud

Pierre Mallinjoud

Tech expert, bioinformatics engineer, web and database developer

I'm a web and database developer with a strong passion for science. This interest led me to study genomics and statistics in order to work in biological research. These days, I'm also exploring AI and blockchain through various personal projects.

Experience at EnyoPharma

At EnyoPharma, I worked as a bioinformatics engineer in the R&D team, collaborating closely with biologists to study human-virus protein-protein interactions. The biologists curated scientific publications to manually extract protein-protein interaction data.

To support this effort, I developed two full-stack applications:

  • Drakkar: A database and web interface designed to assist biologists in curating protein-protein interactions. It provides rich web forms to populate a large PostgreSQL database with curated data.
  • Vinland: A read-only, smaller version of the Drakkar database, offering a public web interface for querying protein-protein interactions and visualizing protein interaction networks.

Both applications share the same technology stack:

  • PostgreSQL as the database
  • A custom PHP backend
  • A React frontend
  • Containerized deployment with Docker

Additionally, I developed Perl scripts to transform the Drakkar database into Vinland. This work and the curated data contributed to a published scientific article.

Meyniel-Schicklin L, Amaudrut J, Mallinjoud P, et al. Viruses traverse the human proteome through peptide interfaces that can be biomimetically leveraged for drug discovery. Proc Natl Acad Sci U S A. 2024;121(5):e2308776121. doi:10.1073/pnas.2308776121

typescript javascript php perl python sql reactjs docker database postgresql uniprot

Experience at CRCL

At the CRCL (Cancer Research Center of Lyon), I built a database cataloging alternative splicing events of the human and mouse genomes. To do so, I started from messenger RNA sequences available in GenBank, which I aligned to the human and mouse reference genomes.

I then mapped Affymetrix exon array probes onto this annotation. This allowed me to create another database compiling differential splicing expression data from numerous experiments, both public and internal to the lab.

This work contributed to a scientific publication and is available online at FASTERDB.

Mallinjoud P, Villemin JP, Mortada H, et al. Endothelial, epithelial, and fibroblast cells exhibit specific splicing programs independently of their tissue of origin. Genome Res. 2014;24(3):511-521. doi:10.1101/gr.162933.113

perl R sql mysql genbank microarrays affymetrix blast

Building a clean peptide-receptor dataset from the PDB

As part of my current work on a future version of Mímir, I am exploring the PDB to build a clean dataset of peptide-receptor pairs for machine learning.

The project starts from public RCSB PDB entries and applies conservative structural filters to keep only pairs that are easy to justify and audit. Instead of trying to repair ambiguous cases, I reject them and keep the selection logic explicit. The goal is not to maximize the number of samples, but to build a reliable dataset for machine learning.

The final dataset is stored in LMDB and keeps structural information such as sequences, residue identities, atom coordinates, B-factors, and occupancy values. I also built a Next.js viewer on top of it to inspect selected chain pairs and make the dataset easier to explore.

python bioinformatics data engineering pdb lmdb nextjs

Fine tuning ESM3 for generative biology

While exploring generative AI for biology, I set out to build Mímir, a model that could generate new peptide binders from target protein 3D structures. It is based on a fine-tuning of ESM3, a 1.4B-parameter protein language model. I built the full training pipeline myself, from data preparation to training and evaluation on cloud GPUs.

Rather than relying on high-level wrappers or basic tutorials, I developed a complete, custom fine-tuning pipeline from scratch. This involved deploying and training the model end-to-end on cloud infrastructure using a Lightning AI H100 GPU. I had to navigate significant engineering challenges to fit a model of this size into memory, implementing techniques like 8-bit AdamW, gradient checkpointing, Flash Attention, and dynamic bucket batching.

Beyond the engineering, this project allowed me to deeply understand the inner workings of ESM3. I learned how to handle its multi-track input design (sequence, 3D coordinates, and solvent accessibility), how geometric attention processes spatial relationships, and how to design custom loss functions for masked language modeling. Even though the model didn’t fully achieve generalized transfer learning due to the complexities of multi-domain structural representations, going through the entire process - from data pipeline to cloud execution and post-mortem analysis - gave me invaluable hands-on experience with large-scale model training.

The project, including its design document and a detailed post-mortem, is available on its GitHub repository.

python pytorch esm3 deep learning fine tuning llm cloud

Fine tuning of a binary classification model

I used a dataset from my past work at EnyoPharma to learn how to fine-tune a model. It contains about 80,000 manually curated scientific publication abstracts labeled by whether they describe protein-protein interactions or not.

Using the Hugging Face library, I fine-tuned a pretrained model into a binary classifier. The goal was not to build a production model, but to go through the full fine-tuning workflow on a real-world dataset and understand where the main improvement levers are.

The study is documented in its GitHub repository and Jupyter notebooks.

python jupyter deep learning fine tuning huggingface transformers data science

Command-line RAG pipeline for websites

I built RAG-URL to explore the architecture of an agentic RAG system end to end.

It covers the full workflow from web scraping and content cleaning to semantic chunking, embeddings, vector search, and LLM-powered interaction.

The system runs in four stages:

  • Scrape: Crawls and extracts content into cleaned Markdown files
  • Chunk: Uses Gemini to split content into semantically meaningful sections
  • Embed: Embeds each chunk with Gemini and stores vectors in LanceDB
  • Agent: A CLI chatbot built with PydanticAI, querying the database with Gemini

The pipeline is built around Gemini models and is not yet model-agnostic or extensible.

python ai agent rag llm prompting gemini pydantic-ai lancedb

Conversation-driven multi-agent framework

I created MC Architecture (Master of Ceremony) as an experiment in multi-agent conversations.

Instead of relying on a central controller to direct every exchange, MC Architecture lets agents take turns based on shared context. The interaction model is closer to a group chat where every participant sees the same history, but only one speaks at a time.

The framework is both agent-agnostic and model-agnostic, and acts as a lightweight wrapper around existing AI libraries. I integrated it with PydanticAI, including participant selection logic to keep the dialogue coherent.

I originally created it for creative storytelling and simulation, but it also looked promising for collaborative problem-solving where context awareness and conversational dynamics matter.

python ai agent llm prompting openai anthropic gemini pydantic-ai