Sebastian Klipp - Data Scientist and Machine Learning Engineer | Collection of Machine Learning techniques, projects and code snippets

Hi there, my name is Sebastian Klipp - I love realising Machine Learning related projects.

Besides my amazing (part-time) job at Babbel I dedicate the rest of my professional time to personal or freelance projects. If you want to work together, feel free to reach out to me!

After graduating with a Master’s degree in physics with a focus on computational modelling and complex system simulations, I worked in the overlapping fields of Data Science, Data Engineering and Machine Learning Engineering. If you want to know more, take a peek at my LinkedIn profile, and also checkout my github.

Here follows a small list of topics I am most interested in:

From idea to deployment: Full data pipeline implementation

I enjoy turning ideas into real life systems, covering all aspects of modelling and data pipeline implementation:

problem and architecture design
ETL pipeline implementation
Feature Engineering
Model selection, training, optimisation and evaluation
deployment and automatisation

Business Analytics and Data Products

It is interesting to model business processes and use data to optimise them, as well as using data as a central part of a product to improve customer satisfaction. Typical projects comprise user-directed recommendations or sales predictions. I also have experience in contributing to data centered business models.

Probabilistic models and model explainability

The most successfull Machine Learning methods (Neural Networks, Xgboost) lack two features that are especially important in Business Analytics and Decision Making: Prediction confidence and transparency about the decision process.
To obtain probabilistic predictions, basically all Bayesian methods can be used, but also specialised models like quantile regression averaging or Ngboost exist. If these do not help, building big ensembles of models with bootstrapped input data and random seeds can give probabilistic estimates of statistical properties.
In the field of model explainability, different techniques to obtain feature importances exist and novel approaches like Shap Scores help to understand a model’s decision and to build trust.

Advanced automatic feature engineering

Every technique we can use to squeeze more information out of raw data, avoiding manual engineering where possible:

Dimensionality reduction techniques: PCA, LDA, t-SNE, Umap
Neural network embeddings: Autoencoder embedding, TabNet embedding
Automatic rule-based feature engineering python packages: featuretools, tsfresh

Tree models

There exists a wide variety of tree models, of which the most prominent ones are arguably the Xgboost and Random Forest models. But there are many more variants of tree models, among these are: catboost, lightgbm, ngboost, extra trees, oblivious trees, isolation forests, lambdaMart or BART.
While for standard supervised problems the established models are probably first choice, there is a variety of niche applications where tree models can be used, e.g. for probability distribution predictions, outlier detection or ranking.