Authors: Susanne Schaller, Johannes Weinberger, Martin Danzer, Christian Gabriel, Rainer Oberbauer, Stephan M. Winkler
We here propose an empirical approach based on the analysis of next-generation sequencing (NGS) data for describing the number of distinct clones of B and T-cell receptors in the human immune system. The status of a human immune system is (amongst other features) defined by the diversity of these receptor cells. It is a well-known issue that NGS data have a higher error rate, and therefore the number of distinct sequences found in sequencing data rises with the number of sequences measured by second generation sequencers. We here present a modeling approach that formulates the number of distinct clones depending on the number of read sequences considering two effects. On the one hand there is a true number of distinct sequences which is asymptotically reached by increasing the number of reads, on the other hand the number of randomly found sequences rises linearly due to read errors. The parameters for this combined model are identified using parameter optimization methods using evolution strategies. This modeling approach is evaluated on the basis of immune status data of several human patients. Additionally, the results are compared to those produced by machine learning methods.