Biostatistics NCBI GEO data collection Database creation

Sdílet
Vložit
  • čas přidán 22. 08. 2024
  • In this 15 min video I demonstrate how to get a gene annotation table and transcriptomic data set from NCBI GEO. The we make a quick gene list of our 'virtual' biomarkers. I show you how to open them in Excel and save in a format for import into Access. We then import these three tables to create our database. I got to show a bit on how to set up a query just at the end. Will complete that and a 'virtual' data analysis in the next video. The second video for this demonstration is titled, "Biostatistics database QUERY and analysis in Excel" • Biostatistics database...

Komentáře • 23

  • @mirirfan9644
    @mirirfan9644 Před 3 lety

    Very helpful thankyou i want to watch the 2nd recording of same lecture.

  • @sakibsarkerii514
    @sakibsarkerii514 Před rokem

    Hello Mr Clyde,
    I am facing problem with GSE14520 dataset.
    Is it possible to apply mRMR algorithm to this dataset? To apply this algorithm to any dataset, we must need a class attribute in the dataset. But GEO dataset doesn’t contain any class label. Then how can I apply mRMR algorithm to this dataset?Please help me.

  • @madhurimadatta2691
    @madhurimadatta2691 Před 6 lety

    Hi, can you suggest me how to download the dataset available in GEO database?

  • @ursbiku
    @ursbiku Před 10 lety

    Hi, its a great video... do u have a second part of this video? if yes, could you please upload it...thanks

    • @clydephelix8947
      @clydephelix8947  Před 10 lety

      Hello Biku, Sorry for the delay in responding. I have added the title of the second in this series and the URL. Appreciate the compliment.

  • @Sarah_Ayyad
    @Sarah_Ayyad Před 7 lety

    Hi Clyde, I would like to ask you another question. I fear it would be a stupid question, but I searched a lot and didn't find a suitable answer
    I want to inquire about the datasets. what the samples represent? do they represent different patients?or different experiments to the same patient

    • @clydephelix8947
      @clydephelix8947  Před 7 lety

      Hello Sarah, Again this was just a data set I had generated for my students - a convenient source of numbers. I do not recall exactly which columns I had used from the original NCBI GEO dataset GSE 23806, but here is the link if you want to download the txt data file for yourself.
      www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23806

    • @Sarah_Ayyad
      @Sarah_Ayyad Před 7 lety

      Sorry, I ask generally for the dataset used in gene expression microarray, what does mean of samples?
      Is it represent different patients?

    • @clydephelix8947
      @clydephelix8947  Před 7 lety

      Let me start my response by referring you to the next two videos in this demonstration.
      czcams.com/video/4kbFefje-xk/video.html
      czcams.com/video/1BjUf3c0oWA/video.html
      I believe your answer would be found in the third video in this series. However, again for demonstration purposes in my class, I ignored the actual meaning of the original columns and just wanted to get sets of numbers to compare gene expression statistically. It is important to state that what I show in this series of videos is not the accepted standard for determining differential gene expression from transcriptome data sets. These are just quickly accessible sets of numbers to demonstrate the software and process.
      You will have to look at the GSE23806 data set page to see what each column is, e.g., if you analyzed these columns you would have compared primary tumor over 12 samples out of the original 92 total samples in this data set. On that page if you click on the GSM#### link you see the information that particular sample (pasted below also). I hope this helps answer your question. This glioblastoma data set does not actually have a control group to compare values, like you might have if you used age-matched-control liver to diabetes mellitus liver biopsy samples. Maybe of these 12 samples you would have samples at different WHO grades - this example below is grade WHO IV.
      GSM587218 primary tumor_T-GS-1
      GSM587219 primary tumor_T-GS-2
      GSM587220 primary tumor_T-GS-3
      GSM587221 primary tumor_T-GS-4
      GSM587222 primary tumor_T-GS-5
      GSM587223 primary tumor_T-GS-6
      GSM587224 primary tumor_T-GS-7
      GSM587225 primary tumor_T-GS-8
      GSM587226 primary tumor_T-GS-9
      GSM587227 primary tumor_T-GS-10
      GSM587228 primary tumor_T-GS-11
      GSM587229 primary tumor_T-GS-12
      Sample GSM587218 Query DataSets for GSM587218
      Status Public on Feb 12, 2011
      Title primary tumor_T-GS-1
      Sample type RNA
      Source name tumor tissue corresponding cell line GS-1
      Organism Homo sapiens
      Characteristics sample name: T-GS-1
      sample type: glioblastoma (GBM), original tumor
      tumor grade: WHO IV
      Growth protocol conventional glioma cell lines and self established glioma cell lines (ML lines) -->Dulbecco´s modified Eagle´s medium, supplemented with 10% fetal calf serum, 2 mM L-glutamine, and 1 mM sodium pyruvate / glioblastoma stem-like cell lines (GS lines) --> neurobasal medium with B27 supplement (20 μl ml−1), Glutamax (10 μl ml−1), fibroblast growth factor-2 (20 ng ml−1), epidermal growth factor (20 ng ml−1) and heparin (32 IE ml−1)
      Extracted molecule total RNA
      Extraction protocol Total RNA was extracted from cells using the RNeasy Protect Mini Kit (Qiagen). Genomic DNA contamination was removed through an on-column DNase digestion step.

    • @Sarah_Ayyad
      @Sarah_Ayyad Před 7 lety

      I got it, I am thankful for your help

  • @Sarah_Ayyad
    @Sarah_Ayyad Před 7 lety

    Hi Clyde, I am new in this field
    I want to inquire about the numbers in the matrix. what they represent?
    I know that columns represent samples and rows represent genes 'features', but what numbers expresses!

    • @clydephelix8947
      @clydephelix8947  Před 7 lety +1

      Hello Sarah,
      I would call those values the gene expression levels (normalized for comparison). This may not be accepted as 'a best method', but did give numbers that served the purpose for my Biostatistics class, that is, the purpose of this video was to instruct my students on one of their assignments.
      In this case I have used globalization to normalize the gene expression values that were in the original text files downloaded from NCBI GEO. This link found by a google search on 'globalization of microarray data' explains this normalization method and the rationale for using it books.google.com/books?id=RFLmBwAAQBAJ&pg=PA158&lpg=PA158&dq=globalization+of+microarray+data&source=bl&ots=e_CxU91Hwd&sig=dxYZaCUFVObCWvd-3yIVpERocC8&hl=en&sa=X&ved=0ahUKEwipy9XByarUAhVIxFQKHdGKAWEQ6AEIOTAD#v=onepage&q=globalization%20of%20microarray%20data&f=false
      I appreciate your interest in my video.
      Clyde

    • @Sarah_Ayyad
      @Sarah_Ayyad Před 7 lety

      Thank you :)

  • @zeyadnassar3194
    @zeyadnassar3194 Před 8 lety

    Hi Clyde, There are some huge dataset, the size is like 12 gigabyte, what is the best way to open them ?

    • @clydephelix8947
      @clydephelix8947  Před 8 lety

      Hello Zeyad. You might be working with NGS RNAseq files if they are that big. I had posted a video (link below) on that challenge before - be sure to look at the very helpful comments contributed by others.
      I had just recently run into a problem with MS Excel in that a gene SNP table had too many rows for one spreadsheet. With that in mind, MS Access might have similar limitations on row number. If the files you are talking about just have a very large number of columns and not too many rows, then the problem is with the power of your personal computer (assuming you have not exceeded the column limit of the software). You can try R for these large files too, but again the power of your computer will be the limitation.
      This reminds me of the 1990s when I would sit at my computer for hours on end waiting for a download to finish and then begin the slow computer process to unzip or install and work with the downloads....
      Six core and eight core (gaming) computers are available and not too costly.
      Using NCBI SRA Toolkit to convert to FASTQ
      czcams.com/video/gKvONx0_lww/video.html

  • @nastaranmarzban1419
    @nastaranmarzban1419 Před 2 lety

    Hi, hope you're doing well
    My major is statistics and unfortunately i know almost nothing about genetics...
    But i have to do some analyses on them.
    I have two questions, would you please help me?
    1) We have 100 people, for each of them we measure their genotypes.
    I wanna know that are these genes have features?
    If i want to explain about what i mean by features, i should say that for example for each atom , say oxygen, we have specific number of protons, electrons and .... but among chemical which has oxygen, there are different number of oxygen...
    So features means something that is fix.
    Now I wanna know that do we have features for each gene?(something that is fixed and not change for different people, genotypes are changed among different people)
    2) if the answer of above question is yes(we have genotypes for each pesron,they are changing from one person to another, and we have features for each genes), how can i have data like this?
    I'll be very thankful if you help me...
    Thanks in advance

    • @clydephelix8947
      @clydephelix8947  Před 2 lety

      I suggest some study about single nucleotide polymorphism (SNPs). An individuals genome is expected to be unchanged throughout life except for certain limited cell types. For example tumors and antibody producing cells. Interindividual variance of genome sequence is a focus of your work it seems. Different SNPs have variance in their influence depending on whether they are synonymous or nonsynonymous. If you are looking for a gene sequence that is never different across all individuals you will need to seek help from a genetics expert. Try Researchgate discussion boards.

  • @tharunreddy7563
    @tharunreddy7563 Před 7 lety

    sir, I cant access matrix tables, Its showing that the page is not working when i tried downloading matrix tables

    • @clydephelix8947
      @clydephelix8947  Před 7 lety

      Hello Tharun. Presumably you are referring to the Series Matrix file. Sometimes NCBI GEO has a glitch. Try again after some delay or with another Data Set that does have the Series Matrix file available. Let me know how it works.

  • @awaisawan4462
    @awaisawan4462 Před 7 lety

    my advisor given me a data for analysis can you help me to analyze and make me understood?

    • @clydephelix8947
      @clydephelix8947  Před 7 lety

      I can certainly try. What type of data do you have to analyze? -genomics, transcriptomics, proteomics, metabolomics.... What school do you attend?

    • @awaisawan4462
      @awaisawan4462 Před 7 lety

      can I having your email id

    • @clydephelix8947
      @clydephelix8947  Před 7 lety

      Search my name on the internet to find my academic affiliation and use that email address. Look at my publications on googlescholar.