R tutorial - Using Factors in R

Sdílet
Vložit
  • čas přidán 6. 07. 2024
  • In this introduction to R course you will learn about the basics of R, as well as the most common data structures it uses to store data
    Join DataCamp today, and start our interactive intro to R programming tutorial for free: www.datacamp.com/courses/free...
    If you have some background in statistics, you'll have heard about categorical variables. Unlike numerical variables, categorical variables can only take on a limited number of different values. Otherwise put, a categorical variable can only belong to a limited number of categories. As R is a statistical programming language, it's not a surprise that there exists a specific data structure for this: factors. If you store categorical data as factors, you can rest assured that all the statistical modelling techniques will handle such data correctly.
    A good example of a categorical variable is a person's blood type: it can be A, B, AB or O. Suppose we have asked 8 people what their bloodtype is and recorded the information as a vector `blood`.
    Now, for R it is not yet clear that you're dealing with categorical variables, or factors, here. To convert this vector to a factor, you can use the `factor()` function.
    The printout looks somewhat different than the original one: there are no double quotes anymore and also the factor levels, corresponding to the different categories, are printed. R basically does two things when you call the factor function on a character vector: first of all, it scans through the vector to see the different categories that are in there. In this case, that's "A", "AB", "B" and "O". Notice here that R sorts the levels alphabetically. Next, it converts the character vector, blood in this example, to a vector of integer values. These integers correspond to a set of character values to use when the factor is displayed. Inspecting the structure reveals this:
    We're dealing with a factor with 4 levels. The "A"'s are encoded as 1, because it's the first level, "AB" is encoded as 2, "B" as 3 and "O" as 4. Why this conversion? Well, it can be that your categories are very long character strings. Each time repeating this string per observation can take up a lot of memory. By using this simple encoding, much less space is necessary. Just remember that factors are actually integer vectors, where each integer corresponds to a category, or a level.
    As I said before, R automatically infers the factor levels from the vector you pass it and orders them alphabetically. If you want a different order in the levels, you can specify the levels argument in the factor function.
    If you compare the structures of `blood_factor` and `blood_factor2`, you'll see that the encoding is different now.
    Next to changing the order of the levels, it is possible to manually specify the level names, instead of letting R choose them. Suppose that for clarity, you want to display the blood types as `BT_A`, `BT_AB`, `BT_B` and `BT_O`. To name the factor afterwards, you can use the `levels()` function. Similar to the names function to name vectors, you can pass a vector to levels blood_factor.
    You can also specify the category names, or levels, by specifying the `labels` argument in `factor()`.
    I admit it, it's a bit confusing. For both of these approaches, it's important to follow the same order as the order of the factor levels: first A, then AB, then B and then O. But this can be pretty dangerous: you might have mistakenly changed the order.
    To solve this, you can use a combination of manually specifying the `levels` and the `labels` argument when creating a factor. With `levels`, you specify the order, just like before, while with the labels, you specify a new name for the categories:
    In the world of categorical variables, there's also a difference between nominal categorical variables and ordinal categorical variables. The nominal categorical variables has no implied order. For example, you can't really say the the blood type "O" is greater or less than the blood type "A". "O" is not worth more than "A" in any sense I can think of. Trying such a comparison with factors will generate a warning, telling you that less than is not meaningful:
    However, there are examples for which such a natural ordering does exist. Consider for example this tshirt vector. It has codes ranging from from small to large. Here, you could say that extra large indeed is greater than, say, a small, right?
    Of course, R provides a way to impose this kind of order on a factor, thus making it an ordered factor. Inside the factor() function, you simply set the argument ordered to TRUE, and specify the levels in ascending order.
    Can you so how these less then signs appear between the different factor levels? This compactly shows that we're dealing with an ordered factor now. If we now try to perform a comparison, this call for example, ..., evaluates to TRUE, without a warning message, because a medium was specified to be less than a large.

Komentáře • 47

  • @buttman20
    @buttman20 Před 7 lety +38

    This is a super concise and well organized video on Factors.

  • @eldarion100
    @eldarion100 Před 6 lety +1

    Excellent video. I am taking a class that does not go into much detail, and this video is the best supplement I could have asked for on this topic.

  • @coopernfsps
    @coopernfsps Před 7 lety +12

    What an absolutely fantastic video, high production quality, great clarity. Thank you very much!

    • @WithASideOfFries
      @WithASideOfFries Před 7 lety +3

      I completely agree. I've been troubleshooting a specific R problem for over an hour with absolutely no luck until I found this brilliant, straightforward video (that was also uploaded on my birthday!).

  • @adiksaff
    @adiksaff Před 4 lety +3

    Very VERY helpful! I somehow did not get this the first time our teacher explained it. Thank you so much!

  • @I2ezident
    @I2ezident Před 4 lety +1

    Great little video, I don't have the time for the long ones - thank you!

  • @benparish6772
    @benparish6772 Před 4 lety

    Now I finally get factors. So clearly explained.

  • @tania409g
    @tania409g Před 5 lety

    I have taken the DataCamp website courses- the practice module is great, I dont have to juggle between windows on my laptop to exercise the skills taught by Filip. He is the best -for Python & R. I got so used to his way of teaching that it was difficult to see another person's method in the same DataCamp website.

  • @AbhishekGupta-wx2ib
    @AbhishekGupta-wx2ib Před 3 lety

    Very concise & clear, absolutely loved it!!

  • @SaintMartini84
    @SaintMartini84 Před 3 lety

    Thank you! The online class I was taking was getting factors after using just str() and I was not although we were using the same .csv file and then they just continued on so it was driving me insane. You filled in the blanks nice and quickly so at least now I know how to convert when necessary and understand more of what can be done with factors.

  • @nwoodw
    @nwoodw Před 2 lety

    This was incredibly helpful and well done. Thank you.

  • @KreshnikMorina
    @KreshnikMorina Před 4 lety

    Such a good explanation! Thanks!

  • @june8390
    @june8390 Před 5 lety +1

    wow amazing👍 I cannot grasp ideas of factor function, But after seeing this video, it automatically solved. I'll Subscribe this.

  • @Cynical_Engineer
    @Cynical_Engineer Před rokem

    This really helped me a lot, thank you.

  • @yellowmellow4753
    @yellowmellow4753 Před 5 lety

    Excellent, thank you.

  • @TheRspeeed
    @TheRspeeed Před rokem

    Very very helpful, thank you!

  • @darrenmalibiran7024
    @darrenmalibiran7024 Před 4 lety

    Do you have a video wherein you manipulate these factors with forcats package?

  • @BDQUERY350
    @BDQUERY350 Před 6 lety

    Great Job friend !!!!

  • @RobinBeaumont
    @RobinBeaumont Před 3 lety

    Great and so clear - what about mentioning dplyr at the end

  • @aVersCloudSolution
    @aVersCloudSolution Před 7 lety

    Thank you. This tutorial is awesome! Exactly what I needed!

  • @jacheto
    @jacheto Před 7 lety +1

    amazing video thank you

  • @KennTollens
    @KennTollens Před 6 lety

    Best tutorial.

  • @matthewsattam1982
    @matthewsattam1982 Před 7 lety

    Great job

  • @viswajeettoshniwal6541

    Thanks a lot sir for uploading this video for helping students

  • @cliderninocespedes7215

    Great video!

  • @Rsingh1
    @Rsingh1 Před 3 lety +1

    Perfect

  • @muhammadhamzahm1204
    @muhammadhamzahm1204 Před 5 lety

    I'm having an error in predicting. My code line is 'predict(rf, testData)'. Error is "New factor levels not present in the training data"

  • @msparisa
    @msparisa Před 4 lety

    excellent, Thank you

  • @rajeswarichowdary8894
    @rajeswarichowdary8894 Před 5 lety

    Nice thank you so much yar👍

  • @michealjennifer3530
    @michealjennifer3530 Před 3 lety

    thank you so much Sir~~

  • @christianrodier3381
    @christianrodier3381 Před 3 lety

    That was helpful thanks

  • @shobana0111
    @shobana0111 Před 2 lety

    Thank you

  • @NeerajKumar-tb3ek
    @NeerajKumar-tb3ek Před 5 lety

    really good video

  • @MariiaHovorukha
    @MariiaHovorukha Před 4 lety

    cool, man!

  • @said2614
    @said2614 Před 3 lety

    Allah razı olsun

  • @mubeenshahid243
    @mubeenshahid243 Před 3 lety

    good bro

  • @moffig1
    @moffig1 Před 5 lety

    The only thing I think would improve your otherwise very well done videos is if you added subtitles to everything he says. Or the option for it anyway.
    The automatically youtube-generated subtitles have a lot of mistakes and are not sometimes not correct

  • @vascoambrosio7798
    @vascoambrosio7798 Před 10 měsíci

    I thought there would be examples for us to practice

  • @sangitaachary3273
    @sangitaachary3273 Před 6 lety

    hello sir, how can i convert a factor variable to be an simple vector...
    like,
    [1] 1 2 3 1 2 2 1 2 1 1 3
    Levels: 1 2 3
    Ans:
    [1] 1 2 3 1 2 2 1 2 1 1 3

  • @ananyamalu1408
    @ananyamalu1408 Před 8 měsíci

    what is bta even

  • @mampipandit2494
    @mampipandit2494 Před 4 lety +1

    Mithun Paul night

  • @distracted900
    @distracted900 Před 4 lety +1

    "O is not worth more than A"
    Bruh

  • @ananyamalu1408
    @ananyamalu1408 Před 8 měsíci

    what the fuck i cant understand anything

  • @ClickyKitsune
    @ClickyKitsune Před 6 lety +3

    Sorry mate but your accent is not clear.. That makes it more confusing.. no offense...

  • @hayderfiras4287
    @hayderfiras4287 Před rokem

    Hello please can you assist me I used smp.l$prof