Bioinformatics - Understanding FastQC/MultiQC (Timestamps)

Sdílet
Vložit
  • čas přidán 25. 07. 2024
  • In this video, I thought it would be a good idea to take a step back and start to understand some of the tools that we have used so far - in particular FastQC and it's quality control reports. FastQC provides a nice report for several important metrics about .fastq read files. Additionally, MultiQC makes it easier to see these metrics for all of the files within a project to see if there are any that are behaving much differently than others.
    This video uses the FastQC report created in the previous video on the raw read file SRR2121770_1.fastq.gz. This is from this publication:
    pubmed.ncbi.nlm.nih.gov/26372...
    Make sure to check out • Bioinformatics - SRA D... for how the files were downloaded and the tools installed/ran.
    FastQC manual:
    dnacore.missouri.edu/PDF/Fast...
    Timestamps:
    0:01:24 - Basic Statistics
    0:04:28 - Per Base Sequence Quality
    0:08:04 - Per Tile Sequence Quality
    0:11:07 - Per Sequence Quality Scores
    0:12:33 - Per Base Sequence Content
    0:14:43 - Per Sequence GC Content
    0:17:36 - Per Base N Content
    0:19:04 - Sequence Length Distribution
    0:20:37 - Sequence Duplication Levels
    0:23:24 - Over-represented Sequences
    0:25:23 - Adapter Content
    Not a fan of doing plugs for "Like, Share, and Subscribe" but if you could, I would greatly appreciate it. Thanks!
    Project Github:
    github.com/ACSoupir/Bioinform...
    Image at the beginning on the bottom left is modified from AllGenetics.EU.
    Please consider contributing to my Patreon where I may do merch and gather ideas for future content:
    / alexsoupir
  • Jak na to + styl

Komentáře • 8

  • @drumpdump1995
    @drumpdump1995 Před 2 lety

    Nice explanation. I found the overrepresented sequences really useful to confirm that your hashtag or cite-seq antibody binding and sequencing quality

  • @gmochales
    @gmochales Před 4 lety +1

    thanks for the video! according to you, which would be the expected pattern in sequence duplication levels for ddRADseq? I have two big peaks in >10 and >100, thanks!

    • @alexsoupir
      @alexsoupir  Před 4 lety +1

      Hey, Gabriel -
      I am not sure. Actually haven't heard of ddRADseq until you mentioned it. Something for DNA sequencing that I have read which can increase the duplication levels is if the genome has a lot of repeats in it. Just like with transcriptome where genes that are highly up-regulated will cause duplication warnings. You can try to run the rest of your analysis and see if there is anything strange that comes from higher duplication levels.
      Another thing might be from PCR amplification after the library has been prepped, depending on how many cycles you run there is a chance that, after too many, you can end up with PCR bias. Not sure how your library was prepped but that could be something to think about, too.
      Sorry for not knowing exactly the answer - I'd have to read more into ddRADseq but it sounds like the dd part is making the selection rather narrow so some duplication would be expected for those genomic regions flanked by the restriction enzymes AND regions of a specific size rather than preparing a library 100% randomly from genomic DNA. But, like I mentioned, that is just a guess.

  • @victorrorisang479
    @victorrorisang479 Před 2 lety

    Can i trim my rawreads twice?... meaning i trim raw reads and take the results and trim them?

  • @chinhhoang2375
    @chinhhoang2375 Před 2 lety

    Is it helpful to remove duplicate before trimming?

  • @arpitabhatt989
    @arpitabhatt989 Před 9 měsíci

    I have done 16S metagenomic sequencing and the sequence duplication percentage is 90 %? Is it fine?

    • @alexsoupir
      @alexsoupir  Před 9 měsíci

      Without more information allow me to put forth some thought:
      1) rRNA sequences are highly conserved regardless of bacterial or animal. Depending on what is being explore with such sequences, and the source, it's plausible to say that the sequence will overlap a great deal.
      2) Duplication may mean a few different things, remember. The use of a unique molecular identifier (UMI) could aid in narrowing down the source of the duplication - is it from a high abundance of a particular species bacteria? is it from PCR bias? is it from some unaccounted for source? Really depends on the parameters from which the data was derived.
      3) Looking at single isolate sequencing results. If the data that is being explore is from a single source, then the duplication *should* be high. If trying to sequence a pure source yet seeing low duplication, would be concerned. if you sequenced a short region of your genome (say 18S rDNA sequence) there should be high coverage and depending how deep the sequencing is, the duplication will increase (only so many locations for unique molecules to originate from thus deeper and deeper, even if different molecules, probability of seeing the same start and end increases).
      4) Inclusion of adapter sequences not known to the software. If working with raw data and some custom adapters that FastQC is unaware of, could imagine it would flag them as duplicated sequences if sequencing through reads (technical issue). Would be more unlikely because of both likelihood of custom adapters (low, niche sequencing company) as well as the tech running the machine allowing a sample with short inserts being sequenced with the wrong kit.
      Summary - not always unexpected that there is duplication. Have to understand several things to assess whether any at all is expected or technical. Would reach out to the sequencing service provider to further explore reasoning.
      acs