8

.mpipks-transcript 11. Data Visualization | 阿掖山:一个博客

 2 years ago
source link: https://mountaye.github.io/blog/articles/mpipks-non-equilibrium-physics-transcript-11-ggplot
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Review of last lecture

00:01 uh just uh just to know that the lecture
00:03 is actually recorded
00:05 so you will be visible uh on the
00:07 recorded lecture at the end
00:09 if it doesn’t bother you
00:13 just just let you know okay great
00:16 so hello everyone back to our lecture
00:21 last time we had a special lecture a
00:24 guest lecture
00:25 by one of our local data scientists
00:29 and uh he uh which was a fabian host
00:33 and fabian explained us from like a
00:36 hands-on perspective
00:38 because that’s his job uh how
00:41 you can detect in
00:45 non-equilibrium systems and the
00:47 non-equilibrium systems that fabric is
00:49 working on are of course
00:50 biological systems and what he
00:54 basically showed is how biology how
00:57 older
00:58 and non-equitable systems manifest
01:00 itself
01:01 in low-dimensional structures in these
01:04 high-dimensional data sets that he
01:05 showed you some
01:07 methods that will also appear on the
01:10 website i didn’t get to uploading uh the
01:12 slides in the video yet
01:13 um so so he showed you some methods of
01:17 how to
01:18 reduce dimensionality or how to extract
01:21 hidden dimensions in these
01:24 in the high dimensional data sets and

introduction & slide 1

01:27 so today so i will start by giving you a
01:30 little bit of
01:31 more introduction to data science and
01:35 some of the things that we need for the
01:36 next lecture that’s in the first part of
01:38 the lecture
01:40 and in the second part of the lecture i
01:43 will
01:43 give you another hands-on experience for
01:46 a practical
01:47 example of from start to finish of how
01:50 to
01:51 go through such a data science pipeline
01:55 now to start the beginning beginning of
01:58 the lecture
01:59 uh we’ll go back to our new york city
02:01 flights
02:02 data set so there’s a little gap because
02:05 we had to
02:06 find dates with uh fabian last time so
02:08 there’s a little
02:09 gap this lecture is now connected to
02:11 what i told you
02:12 uh two lectures ago and uh
02:16 let me just share the slide i’ll tell
02:18 you give you a little a brief
02:20 introduction
02:21 to data visualization
02:24 just a short one because the slides have
02:26 been already
02:28 on the website since two weeks or maybe
02:31 some of you have already looked at them
02:33 let me just share the screen
02:37 there we go
02:40 okay great perfect so so i’ll give you a
02:44 quick
02:44 introduction before we go on an enhanced
02:46 on example
02:48 so today we’ll have a hands-on data
02:50 science example and next week we’ll have
02:52 a hands-on
02:53 field theory combined with data science
02:56 example now that will be
02:57 next week um so today i’ll still need to
03:01 introduce you to some methods that we
03:03 will need
03:04 and um if you
03:08 can confirm now that you can see my
03:11 slides
03:12 can somebody confirm is that working
03:15 yes okay perfect yeah so i’ll just give
03:18 you a quick introduction uh to
03:20 how to visualize uh data
03:24 and of course you’re all in your work
03:25 you’re all visualizing data all the time
03:27 but if your data is high dimensional
03:29 very uh uh and uh complex and structure
03:33 uh it actually matters what you use for
03:36 visualization
03:38 yeah and um so
03:41 with this lectures two lectures ago when
03:43 we talked about this new york city
03:45 data set about the flights that were
03:47 departing from the new york city
03:49 airports i showed you all kinds of ways
03:52 of how to
03:53 do very efficient computations
03:56 on these data sets but all these
03:59 competitions like it didn’t
04:00 really give us any real insight
04:04 and the reason was that we were dealing
04:05 with plain numbers but we
04:07 never had anything to look at and uh
04:11 today in the first part of this lecture
04:13 i’ll quickly
04:15 show you a plotting scheme so to say
04:18 that is very powerful
04:20 in visualizing a general and visualizing
04:23 high dimensional data so

slide 2

04:27 before that just a quick reminder so
04:29 before we do
04:30 anything we always want to make our data
04:33 set uh
04:34 tidy you know so we typically we
04:37 collaborate with experimentalists they
04:39 give us the the
04:40 we obtain the data in a very messy
04:42 format and then the first step is to
04:45 tidy the data and that means that we
04:47 need to bring it in a form where
04:49 every column is an observable or
04:52 variable
04:53 and every row is it is an
04:56 observation or a sample sorry to
04:59 interrupt you did you change your
05:02 first slide yes oh you can’t see that
05:05 no we cannot i don’t know what this
05:06 always happens now in zoom
05:08 there was a zoom update
05:12 there’s something maybe i have to share
05:15 the entire screen
05:17 let’s try this um the problem is just
05:21 that i have like
05:21 100 windows on my desktop
05:25 um if somebody has a system that happens
05:30 lately all the time right um that zoom
05:33 is not working properly okay let’s share
05:37 the desktop
05:42 okay now i’m not having embarrassing
05:45 stuff on the
05:46 desktop but that’s not the case okay so
05:49 you should be seeing my
05:52 uh desktop now
05:58 oh okay okay a lot of
06:01 lot of wind okay okay okay
06:05 now you now you know what i’ve been
06:07 working on okay
06:08 so
06:12 can you see uh this messy and tidy slide
06:17 and when i change the slides now can you
06:19 see that it’s just
06:20 right now it works okay perfect
06:24 so so this is uh so so
06:27 just a just a reminder uh that the first
06:31 step that we always do is to make the
06:33 data tidy now where we if you have the
06:37 data in entire defaults format then we
06:39 can perform
06:40 column wise operations that are in most
06:44 programming language
06:45 highly optimized and very efficient
06:49 to run until to program
06:53 i introduced you to this very simple
06:58 r package that allows you to implement
07:01 all of these operations and data science
07:05 and there are many other packages of
07:06 course and other other statements

slide 3

07:09 and typically such an operation
07:13 consists of three steps so you
07:16 filter the data that is in this data
07:18 table package the first
07:20 parameter then you group
07:24 the columns by some condition
07:27 now that’s the third parameter and then
07:29 you perform
07:31 an operation independently on each group
07:33 that is the one in the middle
07:36 so this is a typical step in such a data
07:39 science
07:40 calculation and i showed you how you
07:42 then can combine these steps along
07:44 pipelines
07:46 uh to perform that complex operations

slide 4

07:50 today i want to show you a way of how to
07:53 more intuitively
07:55 interact with the data and the way we
07:57 typically do that you all
07:59 are familiar with uh plotting of course
08:01 and the way we typically do that
08:03 is we have a plot in mind yeah and this
08:06 plot then has a name
08:08 that’s a scatter plot a surface plot
08:11 a histogram and then you look for the
08:13 function
08:15 in and you look for the function and
08:17 matlab also that gives you this specific
08:19 kind of plot
08:21 yeah another way to do that and that
08:24 would
08:25 introduced by someone called leland
08:27 wilkins
08:28 in a book is that you don’t give the
08:31 plots names
08:33 but you construct you construct these
08:35 plots with a grammar
08:37 that means that you have a set of rules
08:40 that allows you to construct
08:42 step-by-step almost any visualization of
08:45 your data
08:47 and once you have that you don’t have to
08:49 remember long names of different kinds
08:51 of plots
08:52 you just add bit by bit like in a
08:54 sentence you add word or
08:56 or there’s word by word to make this
08:59 sentence
09:00 more rich in information but the only
09:02 thing that you need to know
09:04 is the grammar itself and this allows
09:06 you to
09:07 create from the simple grammar uh very
09:09 different kinds of visualizations
09:12 now and this idea that you have a
09:14 grammar of graphics
09:16 you know i’ve got a set of rules that
09:17 allows you to construct
09:19 visualizations uh it is an r
09:22 implemented in the ggplot package
09:25 gtw 2 that’s quite famous
09:29 and in python and that’s a little bit
09:30 newer i don’t know how well it works
09:33 it’s called the plot
09:34 9 package now that analyze the same idea
09:39 and in r we just load this using the dg
09:42 plot the library command and then we are
09:45 able to use all of these commands

slide 5

09:47 in this package now so the basic idea is
09:52 that we start with a tidy
09:56 data table or data frame in r
09:59 and we take this and then we construct
10:03 assign different
10:06 visual characteristics of our plot
10:10 two different columns of this data table
10:15 yeah so the first thing we have to do
10:18 is uh the first thing we have to do is
10:21 we have to say
10:22 what to plot a plug the point or a line
10:25 and that’s called the geometry
10:28 a point line bar circle whatever and
10:31 that’s the geometry
10:33 then we have this mapping that i
10:34 mentioned yeah where we map
10:36 different aesthetic properties of our
10:40 plot
10:41 two different columns in this table here
10:45 now for example we could say that the
10:47 position on the x coordinate
10:50 should be what is in column a
10:53 the y coordinate should be what is in
10:55 column b
10:56 the size of our dots or whatever should
10:59 be whatever is
11:00 reflect whatever is in column d and the
11:03 column
11:04 should miss a c and the column the color
11:07 should be what is in column d
11:12 how these values are then translated
11:16 to specific colors or to specific size
11:19 or so
11:20 is a different question now we have the
11:24 static properties of our plots or our
11:26 dots so where are they located
11:28 how do they look like and then we just
11:30 have to define a coordinate
11:32 system to define where they appear
11:36 on the screen yeah and if you have these
11:39 things together
11:40 we have the simplest version of a plot
11:44 on the right hand side

slide 6

11:48 so the way this works in practice is
11:51 that you
11:52 have these little building blocks
11:55 that you just put together
11:58 line by line now so
12:01 in r this looks like that you first have
12:03 to of course do
12:04 something to create an object and this
12:07 is just this ggplot command
12:10 where you tell the plot what data to use
12:13 as the first argument
12:15 and the second argument is then how to
12:18 map
12:18 different columns of your data
12:21 of your table to different pro visual
12:24 properties
12:25 of your plot and then you add a geometry
12:30 for example point or line and you have
12:32 your first plot already
12:35 you can of course also add more
12:37 properties yeah you can add more
12:39 geometries or you can add more
12:41 uh detailed properties of your plot for
12:44 example
12:45 if you are not happy with cartesian
12:48 coordinates then you can set your own
12:50 coordinate system
12:52 now you can have polar coordinates or
12:53 whatever
12:55 you can have subplots by
12:58 by adding this further rule here it’s
13:01 called facets
13:03 you can change how values
13:06 in this data table how they map
13:10 to different properties so which color
13:12 represents
13:13 which value in your table you can change
13:17 themes for example or different uh the
13:20 way
13:20 lines and whatever are plotted and you
13:23 can of course also save your file
13:25 and these different building blocks of
13:27 your plots
13:28 are connected via these plus signs now
13:31 you put that
13:32 just as many as you want of these things
13:34 of these aspects
13:36 after each other and by this you
13:38 construct more and more complex
13:40 plots yeah and for
13:44 everything that you see here below there
13:45 are some sensible defaults for example
13:49 cartesian coordinates that in many cases
13:52 you don’t need to touch
13:54 now let’s have a little look at our new
13:56 york city data set
13:59 and this new york city data set right
14:02 here we go

slide 7

14:02 now we already discussed that uh
14:06 two weeks ago and in this new york city
14:08 data set we have
14:09 information about flights departing from
14:13 new york city airports for each flight
14:16 we have
14:17 different information we have for
14:19 example the time
14:21 when this flight departed we have the
14:24 origin
14:24 airport we have the number of the plane
14:26 that was used
14:28 the carrier and so on and the delay
14:31 uh for this specific flight and also as
14:35 we already discussed
14:36 you can connect this table that we can
14:39 have
14:40 that you can download from our github uh
14:44 from the from the website um we can
14:47 connect
14:48 these flights then to other sources of
14:50 information for example
14:52 the weather that uh with the weather
14:55 information for a given
14:57 point in time and a given location
15:00 we can also connect that to airport
15:02 information
15:04 uh we can connect that to information
15:06 about the planes
15:07 yeah and we can also get information
15:10 about the airlines if you want
15:12 yeah and that’s how we what we do yes we
15:15 reload
15:16 again just like last time where we load
15:19 again
15:20 all of these different files using the
15:23 function f
15:23 read and then we merge them together
15:27 using these merge commands and when we
15:30 merge them together we sometimes have to
15:32 give for example when we merge with the
15:35 planes
15:36 or let me see here for example when we
15:38 move with the airports that we merge
15:41 with the airports we have to say that
15:43 here the airport identifier is in the
15:46 column origin
15:48 and here the airport identifier is in
15:51 the column
15:52 faa now so we motor all of these things
15:56 together and we have a huge data set
15:58 a data table containing all of these
16:01 information line by line and
16:04 in a tiny format we already did that two
16:08 weeks ago
16:10 now let’s have a simple look at these
16:13 plots let’s just let’s say let’s have a
16:15 simple plot

slide 8

16:16 and the first thing we can do is uh like
16:19 just like last time we can calculate
16:22 the average delay
16:27 for each month so we group the data by
16:29 month
16:31 and for each month we take the average
16:35 over all departure delays and save that
16:39 average
16:39 in the column mean delay
16:42 now so what we don’t get is that for
16:44 each month
16:46 here in the first column we get a mean
16:49 departure delay this is the second
16:51 column
16:52 and below you can see now the simple
16:54 spot you can do you can
16:56 tell this ggplot function to take this
17:00 table
17:01 now a very simple table and map
17:04 the month to the x-axis and the delay to
17:08 the y-axis
17:10 and then you just add a geometry
17:13 which is just a point then you get what
17:16 you see on the right-hand side
17:18 and you see that something is happening
17:20 here in this summer months
17:22 and something is happening over
17:24 christmas apparently
17:26 okay
17:31 so okay there’s something something in
17:33 the waiting room right
17:35 okay so something is happening over
17:37 christmas now let’s go on

slide 9 & 10: different types

17:40 uh we can of course also add different
17:42 geometries to a plot
17:44 yeah so these different geometries so we
17:46 just use the geometry
17:49 of a line now of a point we can also add
17:52 different geometries
17:54 and for the sake of simplicity what i’m
17:56 doing here i’m using the tools that i
17:58 introduced two weeks ago
18:01 not to do all of these things in one
18:03 line so we take the flights
18:06 data set calculate for each month the
18:08 average delay
18:11 and send everything with this pipe
18:14 operator here
18:16 to the g plot and in the g
18:19 plot we just need to define that this
18:22 aesthetic mapping
18:23 that the x coordinate is the month and
18:25 the y coordinate
18:27 is the delay and we save this everything
18:31 we just save in an object g on the left
18:34 hand side
18:36 and now we can take this g and add
18:38 different things to it
18:39 we can add different geometries on the
18:41 left here top left we have the
18:44 the point as before we can add a
18:47 geometry line
18:48 now then we get a line we can add a bar
18:52 that’s called column we can add a bar
18:55 or we can add all of them together to
18:58 the plot
18:59 and then we have all of them together
19:01 this information of what
19:03 what happens with the data that’s not
19:06 done in the geometry that’s we have done
19:08 once in the beginning and now we can
19:10 just operate on this object and add
19:12 different things and change the part the
19:14 way we like it
19:18 yeah so there are also geometries that
19:21 are
19:21 that involve analysis now for example uh
19:25 if you have a background in biology
19:27 maybe then you know your favorite
19:29 dot box plot on the right hand side that
19:32 summarizes
19:33 different uh properties of the
19:36 statistical distribution
19:38 for example here in the left hand side i
19:41 take the flights
19:42 all this combined information i take it
19:46 use the x the carrier as the x
19:48 coordinate
19:49 and the logarithm of the departure delay
19:52 as the y-coordinate
19:53 and i add this box plot where i have
19:56 automatically
19:57 the median and this is some
20:01 inter-quarter range and then i
20:04 always forget uh forget what this means
20:06 is probably the range of the data
20:08 without
20:08 outliers or so now that’s basically in
20:11 some disciplines these box plots are
20:13 used to characterize distributions
20:15 another way to characterize
20:17 distributions are violence
20:19 violent plots that essentially give you
20:22 a plot of the probability distribution
20:24 here
20:25 um just in a vertical manner the thicker
20:29 the violin is
20:30 the more the the higher the probability
20:32 to find a data point
20:34 there yeah and

slide 11 & 12: aesthetic

20:37 uh so of course we can now also play
20:40 with
20:40 how these plots look like so
20:43 for example now i’m and if you look at
20:46 the red
20:47 i’m doing the same operation yeah i’m
20:50 calculating the average departure delay
20:54 for each month and each airport and each
20:57 carrier
20:58 but now i’m just for simplicity i only
21:01 take the big
21:02 three carriers united airlines delta and
21:04 american airlines
21:07 and now i create this plot again now i
21:09 have this aesthetic mapping
21:12 the month should be the x coordinate the
21:14 delay the y coordinate
21:16 and now i have another aesthetic which
21:19 is the color
21:20 i say the color should be the origin
21:23 airport
21:25 and the line type line type should
21:27 correspond to the carrier
21:30 and now i just add the geometry the line
21:32 and i get this plot that you see on the
21:34 bottom here
21:36 you can see that all carriers have a
21:38 problem in the summer month
21:40 and also in the over christmas you know
21:43 except for america so something is going
21:46 on with american
21:47 airlines around march uh no idea what
21:50 this is
21:52 yeah and uh
21:55 no wait that’s not american airlines now
21:56 that that’s new arc
21:58 newark american airlines has a problem
22:01 in march
22:02 yeah you see it’s not the problem the
22:04 plot is not perfect yet
22:06 and uh okay so we can go on now we can
22:09 add
22:10 other aspect we can change other experts
22:13 of the plot
22:14 for example i here say that the film
22:18 should be the airport and then i use a
22:21 box plot
22:22 and then i get an overview over how the
22:25 difference
22:26 airport the airports the different
22:27 airports compared to each other
22:30 for each carrier and what you can see
22:34 is that jfk is doing well for
22:37 some of them but not for all
22:40 you know american airline and for united
22:43 and
22:44 and what is it where is it delta is
22:46 doing well
22:48 um but there is no clear trend here of
22:50 course

slide 13: subplots

22:52 something that’s more interesting is if
22:54 you plot these delays
22:57 for the big three carriers yeah
23:00 uh as a as a function of the hour of the
23:04 day
23:06 yeah so here the x-axis the x-coordinate
23:09 is the hour and i turn that into a
23:12 factor
23:13 from numeric to something that is
23:16 discrete just for plotting purposes
23:19 the the fill color is the origin airport
23:24 and i added here a
23:27 subplot now that’s called a facet
23:30 by carrier if you remember the two
23:34 lectures ago this is a formula we can
23:36 use a formula in r
23:37 to specify how plots are distributed
23:40 across different subplots
23:42 and you can see that for a delta and
23:44 united airlines
23:46 you can nicely see how these delays
23:49 add up during the day
23:52 yeah and yeah and it looks a little bit
23:54 even that is
23:56 let’s have a look at the next slide

slide 14

23:58 maybe uh let’s
23:59 have a look at the next slide so if we
24:01 can have also more complicated subplots
24:04 yeah so for example we have we can have
24:06 a grid
24:07 by using a more complicated formula
24:10 where the
24:11 y-axis yeah the y-direction should be
24:13 the origin
24:14 and the x-direction in this grid should
24:17 be the carrier
24:18 yeah and then we got these plots and you
24:21 can see
24:21 how you can actually see how these
24:23 delays
24:25 add up during the day and it seems like
24:28 a little bit yeah it’s a speculation
24:30 but because we have a logarithm in the
24:32 y-axis
24:34 and we have linear linear increase
24:37 over time uh in these delays
24:41 uh over over time during the day that
24:44 you have an exponential
24:45 build up of delays now that’s quite
24:47 quite interesting

slide 15: plot of 3 variables

24:49 yeah okay so we can do all those kind of
24:53 other fancy thing we can if you have
24:55 more than two variables
24:57 now for example when we calculate the
24:59 average delay
25:02 as a function of the month the hour and
25:05 the origin
25:05 airport yeah then we have more than
25:09 we have then we have then even more
25:11 variables that we want to visualize
25:14 uh we can do that for example with such
25:16 something that’s called a heat map
25:19 yeah and this heat map here the
25:24 the the fill and also the color of these
25:27 tiles
25:28 is given by the mean delay
25:31 while the month and the hour are plotted
25:34 on these axes here
25:36 then we add the geometry of the tile to
25:39 get these heat maps
25:41 then you can visualize relationships
25:44 between two variables namely month
25:47 and hour and it seems like this buildup
25:51 of delays is specifically drastic at the
25:54 summer months
25:56 while it’s not that evident in other
25:58 months

question on assigning facet

26:02 excuse me i have a question syntax
26:06 in this where you’ve written facet
26:10 underscore rap the first argument tells
26:13 us
26:13 the x argument and so origin will be
26:16 plotted on the x scale
26:18 yes so so this facet rep
26:21 is just to say okay take
26:25 one column of the data table yeah it is
26:28 already in this case origin
26:31 and group the data according to this
26:33 column your origin
26:35 and then make one plot for each of these
26:39 origin
26:40 airports and put them next to each other
26:44 as many as fit on the screen
26:47 now and if they don’t fit on the screen
26:48 go to the next line
26:50 it’s basically just this wrap what i’m
26:53 saying here is that you have a
26:54 one-dimensional so to say
26:56 line of plots uh compared to this grip
26:59 that is
27:00 this grid here where was that yeah this
27:03 grid
27:04 is the same thing it’s basically the
27:06 same thing
27:07 uh so here we have these two coordinates
27:10 now these two
27:11 directions origin airport on the on the
27:14 y direction
27:17 and carrier on the x direction
27:20 now this is this this formula notation
27:22 and r
27:23 yeah so that’s it’s a little bit counter
27:25 intuitive but you give it a formula
27:28 uh in order to uh to to tell
27:32 r how all this this package how these
27:34 plots should be distributed
27:36 on your screen or on the on the in the
27:38 pdf file that you export
27:41 yeah and you can if you want you can use
27:46 the reason why the formula i used here
27:48 is that you could do something here
27:51 that you have here carrier plus
27:54 for example um what are we plotting here
28:00 carrier plus uh um
28:04 month or so yeah you can have here
28:08 uh if you do something like this
28:12 you can have more complicated formula
28:15 to say that a combination of carrier and
28:19 month of these two columns
28:21 should be on the x direction here
28:27 and on the y direction you have origin
28:30 now you can
28:30 you can construct more complicated
28:34 grids of plots if you wanted to
28:37 it’s very often that’s not very useful
28:40 and
28:41 just think about that that this is
28:43 what’s on the uh
28:46 that was my that was my question because
28:48 in the next slide the
28:50 original argument the origins have been
28:52 plotted on the x
28:53 scale whereas in this particular case
28:55 the original airport has been plotted
28:56 along the y scale
28:59 so now i exactly so now i’m trying to
29:02 get my mouse cursor back is here
29:06 so okay now i can change the slide
29:08 hopefully
29:10 okay here we go so here i just left away
29:14 the first argument
29:17 and i left the the left one so that was
29:20 originally the y
29:21 direction and now i only have the x
29:24 direction left you know from left to
29:26 right
29:28 and i use here wrap and not grid because
29:30 i want
29:31 this these plots if i have not
29:34 three but like 15 different groups
29:38 i don’t want them to be all in the same
29:40 line because i wouldn’t be able to see
29:41 them on the screen
29:43 rep means once the screen is full go to
29:46 the next line
29:47 yeah nothing else that’s not it’s not a
29:50 complicated thing here let’s just
29:52 just just make each plot for each origin
29:55 one plug for each origin
29:57 excuse me yes uh on this heat map
30:00 the color bar the minimum value is not
30:03 zero
30:05 does it mean that yes yes
30:08 rights that departed earlier than
30:10 scheduled
30:13 yes there are yes exactly exactly they
30:16 departed earlier oh
30:19 it’s not very funny for the passengers
30:22 yes so so but but sometimes that happens
30:25 and it also it’s also a question how the
30:27 data is recorded
30:29 um depends on how the data is recorded
30:33 so especially so this is position
30:37 specifically affects the early mornings
30:40 right and the very late times
30:44 now so this negative is negative
30:45 departure delays
30:47 um i mean that sometimes that can happen
30:51 you know so sometimes
30:52 the question is when that’s not the time
30:56 when the gates close
30:58 uh it’s probably the time when the
31:00 airplane starts or something like this
31:04 yeah and as you as you know if the
31:06 boarding is
31:07 so very often it happens that that once
31:09 boarding is completed
31:11 the the airport they will sometimes the
31:14 airplane leaves a little bit earlier
31:15 than uh than the schedules
31:18 okay thanks but as always there’s a good
31:22 thing
31:22 it’s always good to to to know how the
31:24 data was actually collected
31:26 now you think that a delay is well
31:28 defined but then you can
31:29 measure this in different ways yeah
31:32 that’s always a very important aspect
31:34 aspect and here are also this data set
31:37 they’re also
31:37 missing missing numbers a lot of missing
31:41 numbers
31:41 and that’s when an airport started
31:44 somewhere but didn’t end up on its
31:47 rival location but it’s on other
31:49 airports yeah so that’s also
31:51 that’s also possible

slide 16: error bar

31:54 okay so let’s uh go on just uh
31:58 so we can also use uh gg plot you have
32:01 to do statistical computations
32:03 on the fly and this is uh particularly
32:06 useful for computating
32:08 fancy error bars that’s that’s
32:11 how i use it and so what
32:15 we can do for example here on the top is
32:18 that
32:19 we have the hour on the x-axis the
32:22 departure delay
32:24 on the y-axis and then
32:27 color and uh and fill as the origin
32:30 airport and then for each of these
32:33 combinations because we take on the left
32:35 and raw data
32:36 now we have many different values we
32:38 have many different flights
32:40 and we can then take just a summary
32:43 function
32:44 statistical summary function from the
32:46 gdg plot
32:48 tell this function to calculate the mean
32:53 and use the geometry of a line
32:56 to do the computation for us yeah so for
32:59 so simple things like the mean
33:00 calculation
33:01 that’s pretty good but the nice thing is
33:03 that we can also do
33:05 uh summary functions that are more
33:08 complicated yeah for
33:10 example in this here we have a
33:11 bootstrapping
33:13 confidence intervals yeah that’s that’s
33:16 basically a fancy way of calculating
33:18 confidence intervals
33:20 and we use a geometry of a written
33:23 not to visualize these confidence
33:25 intervals
33:27 and if you deal with confidence
33:28 intervals you know that’s quite
33:30 complicated to calculate and then you
33:32 somehow have to bring them into your
33:33 plot
33:34 so here you don’t have to worry about
33:36 this you have the most fancy methods
33:37 just
33:38 uh in one line and you get a nice
33:41 visualization of the uncertainty of your
33:43 data
33:45 another thing we can do is also here in
33:47 this upper case our
33:49 x-axis is discrete because it’s an hour
33:52 from from 0 to 24
33:55 or 23 or 5 to 23
33:58 but sometimes we have real number values
34:01 yeah and then we need to tell
34:03 these functions which values to put
34:06 together that means we can
34:07 bin the data an example here is the
34:10 temperature
34:11 in fahrenheit on the
34:15 x-axis and we cannot calculate for every
34:19 value of the temperature
34:21 uh we cannot calculate a separate mean
34:23 yeah because temperature is a real value
34:25 it’s uh
34:27 a real valued uh quantity
34:30 yeah and we we need to define a bin to
34:33 summarize values of the temperature
34:36 and that we can do automatically also
34:38 with these start summary bin
34:40 functions where we just calculate
34:43 we just tell the function to bin the
34:45 data
34:47 and for each of these bins again
34:48 calculate the mean
34:50 and plot the result with the geometry of
34:53 a nine
34:54 and we can do the same fancy arrow bars
34:57 and confidence interval calculations
34:59 as before and also you know probably
35:03 that
35:03 these kind of processes are quite
35:05 complicated if you have to do them
35:07 yourself
35:10 and here probably the the me the the
35:12 message is that it’s
35:14 not good if it’s too cold or too hot
35:18 but then there are correlations right so
35:21 when it’s hot here in on the right hand
35:25 side
35:25 there’s also when the holidays take
35:27 place yeah that’s july
35:29 and june yeah in these previous heat
35:32 maps
35:33 where we had these delays yeah so it’s
35:35 not
35:36 it’s not quite quite clear whether it’s
35:38 the temperature that’s bad for the for
35:40 the engines or something like this
35:42 or whether it’s just the number of
35:44 people that
35:45 go on holiday and block the airport
35:48 and lead to delays there

slide 17: interpolation

35:53 now we can do even more fancy things now
35:55 we can do
35:56 interpolation yeah on the right and the
35:59 left hand side you know i just
36:00 uh had the same plot as before
36:04 we had um yeah we have the month
36:08 wait okay here’s a little error the hour
36:11 on the x-axis
36:13 and we do as the hour on the x-axis
36:17 yeah and now we can add like an
36:19 interpolation line
36:20 just in one line with this summary
36:23 function
36:24 and if we wanted to we would be able to
36:26 do like
36:27 non-linear interpolation linear
36:29 interpolation or anything we want just
36:31 with an argument
36:33 and we get our usual nice error bars for
36:36 free
36:38 of course if we can do non-linear fits
36:40 we can also do linear fits so we can fit
36:42 linear models to check for correlations
36:45 and that’s what we do on the right hand
36:47 side and on this right hand side
36:50 what is actually quite interesting here
36:53 is that on the x-axis we have the month
36:56 on the y-axis we have the number of
36:59 seats
37:00 in an airplane and the color
37:04 is the origin airport
37:07 now what we find here is that there’s a
37:09 perf almost perfect linear relationship
37:12 between the month and the number of
37:14 seats
37:15 yeah there’s a perfect linear
37:16 relationship which is positive
37:20 for the for new arc and
37:23 jfk and uh
37:26 negative for lga which is i think
37:29 laguardia
37:30 yeah so no idea where this comes from
37:33 yeah but apparently you are sitting
37:37 in december in a smaller plane
37:40 higher likelihood to sit in a smaller
37:42 plane in december
37:44 if you are departing from laguardia
37:47 while the planes get larger
37:50 throughout the linearly larger
37:52 throughout the year
37:54 for some reason yeah so that’s that’s
37:56 one of these
37:57 one of these things where you should be
37:58 suspicious and check what is actually
38:01 the underlying
38:02 reasoning for this data yeah so so
38:05 that’s that’s something i will show you
38:06 also later
38:07 is that it’s very important to check
38:10 whether
38:10 when you do statistical computations
38:12 whether they actually make sense or not
38:14 yeah big data gives you every result
38:18 that you want if you just look for it
38:21 now just because you have so many
38:23 dimensions so many samples you can
38:25 find every hypothesis you want in these
38:28 data sets
38:29 if you just keep looking for it

slide 18: scales

38:33 yeah so we can play around with scales
38:36 yeah so that means that we can change
38:38 how our plots look like for example
38:42 uh in this plot here on the top that’s
38:44 something we have shown
38:45 seen before that’s the box plot we save
38:48 this
38:48 in a variable p now and now you see how
38:51 i this weird arrow assignment operator
38:55 in
38:55 r why why why why the r community
39:00 likes that you can have that you can
39:02 assign in the different direction
39:03 right so i have the plot and i assign
39:06 the results to a variable p
39:09 and that’s not something so it’s it’s an
39:12 asymmetric
39:12 assignment and so now we have our plot
39:16 and we can add different
39:18 color scales and we can add different
39:20 ways of how our data values
39:23 map to visual characteristics of the
39:26 plot
39:28 for example we can add a new scale score
39:31 for some scale color blur then we get
39:33 different
39:34 blue tones we can also add
39:38 like a manual mapping where we say that
39:42 the uwt
39:43 want to have black gray and white as the
39:45 colors for our airports

slide 19: positions

39:48 now so we can add change visual
39:50 characteristics and we can also change
39:52 of course how things are positioned
39:56 relative to each other yeah and i’ll go
40:00 quickly over this because that’s a
40:01 little bit of a detail
40:03 yeah so we have for example we can
40:05 create this uh plot here on the right
40:07 hand side
40:08 where we for each plot for each month
40:12 origin and carrier carrier calculate the
40:16 average delay
40:17 for the three carriers and then we can
40:20 make a plot
40:23 where we assign the month to the x-axis
40:25 now to the white
40:27 the delay to the y-axis the fill
40:30 color to the origin airport
40:34 and the transparency of this color
40:37 to the carrier yeah and then first
40:41 we can now plot all of this using a bar
40:44 plot
40:45 and if we have a bar plot we can decide
40:47 how to
40:48 put these bars relative to each other
40:51 and i’ll just
40:52 give you three examples we can stack
40:55 them on top of each other
40:56 that’s on the right hand side
40:59 now we can dodge them
41:02 that means that we put them next to each
41:05 other that’s in the middle
41:07 and we can use a fill
41:10 position that means we always fill them
41:13 up to one
41:14 that means we look at the fraction that
41:16 a certain carrier
41:17 and origin airport contribute
41:21 to the total delays and
41:24 uh let me just see if there’s any
41:28 yeah you can see here for example so
41:30 that that
41:32 that that here for example a large
41:35 fraction of the delays in march
41:37 comes actually from this newark airport
41:41 uh while other months uh for example in
41:44 the summer month
41:45 the larger fraction of the lace actually
41:47 comes from the other airports
41:49 um jfk and lga
41:53 yeah we can also do something

slide 20: corrdinate system

41:56 we can also change how the data is
42:00 how we can change the coordinate system
42:02 you know so now we always assume that we
42:04 have
42:04 cartesian coordinates but we can of
42:06 course have any other coordinate system
42:09 so here for example we are plotting uh
42:12 as the x-coordinate the wind direction
42:16 as the y coordinate the departure delay
42:20 and then we just calculate the average
42:22 delay
42:23 again using the summary function
42:27 for a certain certain intervals of the
42:30 wind direction
42:33 yeah and we can then plot that for
42:35 example in different ways
42:38 um we can plot that of course in
42:41 cartesian co-coordinates
42:43 something that’s more instructive is
42:45 actually when we talk about
42:46 directions is to use polar coordinates
42:50 yeah and you can see that i do that just
42:53 with one adding just one line
42:54 one more rule to the plot
42:58 and now i can add more
43:01 aesthetic mappings for example i can
43:03 separate
43:04 as before these different contributions
43:07 from the wind direction
43:09 by airport and this is what i’ve done
43:11 here i just added one more aesthetic
43:13 mapping here
43:15 i said that these bars should be next to
43:18 each other and not on top of each other
43:20 or so
43:21 and i have the polar coordinates as
43:23 before
43:24 yeah then you get the plot that you have
43:26 on the right hand side
43:28 and in this plot what you see
43:31 is that there is a relation with the
43:33 wind direction
43:35 of the departure delays specifically
43:38 when the wind
43:39 comes from what is that southwest
43:43 a little southwest west for the two
43:46 airports
43:47 uh lga
43:50 and newark you know whatever is the
43:53 reason for that
43:54 it’s actually for all of them yes and
43:56 actually for all of them
43:58 uh it’s actually for all of them
44:01 but it’s specifically strong for lga and
44:04 new arc
44:05 and if you look at the location of new
44:07 york that’s where the sea is
44:09 that’s probably also where a lot of the
44:11 wind
44:13 the strong winds come from from this
44:15 direction
44:16 yeah okay yeah this was just playing
44:19 around with the data
44:20 and you get some some insights
44:23 from just looking visualizing the data
44:27 and these insights are of course much
44:29 harder to get if you just look at data
44:31 tables on the console as we did two
44:34 weeks ago
44:35 and what you can also see here if you
44:38 create such plots
44:40 yeah so you can make more and more
44:42 complicated plots
44:44 but the complexity of your code
44:47 never changes yeah it does only
44:50 increases just linearly because you’re
44:51 adding just one bit by bit
44:53 to your plot one layer by layer to your
44:56 plot and you can make
44:57 as complicated plots as you want
45:00 from this now without adding
45:04 more and more complexity actually to
45:06 your code or with
45:07 without requiring requiring more and
45:11 more specialized
45:12 functions and that is the advantage of
45:17 having such a
45:19 grammar of graphics now that allows you
45:21 to to have simple rules
45:23 uh that visual rules that allow you to
45:26 add more and more components to a plot
45:29 and then of course we can make these
45:31 plots look

slide 21

45:32 nice yeah we can add things like
45:35 axis labels for all of our columns
45:39 typically you get a data table that is
45:42 that where some experimentalist has used
45:45 their own notation for things
45:47 doesn’t make much sense most of the time
45:49 you want to have your own
45:51 um you want to have your own uh
45:54 um names for the for the x’s and for the
45:57 colors and for the
45:58 for the legends and uh specifically
46:01 including units if you want to publish
46:03 that and then you can do that easily
46:06 with this labs command and there’s also
46:09 a title command if you want
46:11 here you can add a title and
46:14 annotate your plot as much as you want

slide 22: extensions

46:17 and then you can get as complicated as
46:21 you want you can
46:22 download extensions for example some
46:25 nice extensions add
46:26 new geometries and new coordinate
46:29 systems to these plots
46:31 so here these uh plots that are used in
46:33 anatomy
46:34 they add the human body and mouse body
46:37 coordinate systems
46:38 and you can then easily without adding
46:40 having more complexity
46:42 than what i already showed you you can
46:45 have visualize your data
46:49 that you for example imaging data on the
46:52 mouse or
46:53 human body or whatever you want to do
46:55 now that looks like
46:56 a ton of different extensions to this
47:00 okay so this is a very efficient way of
47:02 visualizing data

slide 23: there is also python implementation

47:03 that relies on a grammar or a set of
47:06 rules
47:07 i showed you an r implementation but
47:09 there’s also a python implementation
47:12 and the python implementation is rather
47:14 new so it’s just
47:16 i don’t know what quality this is and
47:20 what we now want to do
47:24 is i want to uh
47:28 i want to show you how to use these
47:31 tools that we
47:33 that we’ve seen in the last couple of
47:34 lectures in a specific

The following slides are nowhere found in the slides shared on the course website. The lecturer went through the jupyter notebook about a RNA sequencing project.

47:37 data science project and for this we’ll
47:40 just
47:41 go through the code of a real data
47:43 science project and this is a project
47:45 that fabian actually did
47:47 while he was in the group and the
47:50 uh starting point of this project
47:54 is a so-called sequencing experiment now
47:57 so i’ve already showed you this table
47:59 and here on the x-axis yeah
48:02 on the the color the rows in such an
48:05 experiment
48:06 on such that that look that would be so
48:08 to say a matrix that
48:10 experimentalists would send you so here
48:13 every row
48:14 is a different gene and every column is
48:20 a different cell
48:21 now we have now maybe then twenty
48:23 thousand cells thirty seven thousand
48:25 cells
48:26 and for each of these cells we have
48:29 roughly
48:30 ten thousand measurements yeah and
48:33 these measurements correspond to how
48:36 strongly
48:38 uh a certain gene yeah that’s on the
48:42 in the row here is expressed in this
48:45 particular cell so these numbers here
48:47 correspond to how many products of these
48:50 genes these experimental techniques
48:53 found in a given cell yeah and the way
48:56 and these genes
48:57 you might hurt they tell us a lot about
49:00 what cells are doing and how they’re
49:02 behaving and what kind of cells they are
49:04 so they’re very important
49:06 molecular measurements of what’s going
49:09 on inside cells now so for example here
49:13 this gene now that has this id here
49:16 that’s a cryptic name and row four
49:20 is not expressed it has a little bit
49:23 information in this particular cell but
49:26 not in other cells
49:27 what other genes like this one here have
49:30 very
49:31 high expression values they have very
49:34 high counts of products from these genes
49:38 that were detected in these experiments
49:42 so what i have to tell you is that these
49:44 experiments are extremely messy
49:47 yeah so so especially there are there is
49:49 a step where
49:50 ex the data is exponentially amplified
49:54 and that exponentially amplifies errors
49:57 in these data sets so it’s a
50:00 it’s a big mess yeah and now we have to
50:02 find some structure
50:05 in these simple in this high dimensional
50:08 data set in these
50:10 genomics experiments and to show you how
50:13 this is working i’ll share another file
50:19 um i’ll share another screen
50:22 [Music]
50:24 where are we
50:28 here
50:31 here we go now i’ll just give you a uh
50:34 like a hands-on look and
50:37 on how this actually works yeah i don’t
50:40 tell you too much about the biological
50:43 background in this project because it’s
50:45 not yet published
50:47 um so um can you see you can you should
50:50 be able to see
50:51 the browser right
50:54 and you can see that this here is
50:57 actually a combination
51:00 of python yeah the first block
51:03 and r yeah so this here
51:07 here he’s loading some r packages
51:10 and here he’s loading python packages
51:13 and all of this is a jupyter notebook
51:15 yeah so yeah combining r and python to
51:18 take the best of both worlds
51:23 yeah and then we then there’s a lot of
51:25 data loading going on
51:27 yeah so i’ll just go through of course
51:29 we don’t have to look in detail how the
51:31 data is loaded
51:32 uh some some biological background
51:35 information about what
51:36 different genes are doing and
51:40 so on yeah and now
51:45 we start with the pre-processing of the
51:48 data
51:49 so as total that this data is messy
51:53 and this data has like
51:56 80 percent nonsensical
52:00 the nonsensical information in other
52:03 words this is dominated by
52:05 technical noise our technical noise is
52:08 extremely strong
52:10 and it gives rise to very weird results
52:13 so the first step we always have to do
52:16 and this
52:17 particular example in genomics but also
52:19 in other data sets
52:20 is we have to look at the data and and
52:23 polish it in a way so that we can are
52:27 actually in principle able
52:29 to detect information here
52:34 so for example this plot here on the top
52:38 shows you basically what is the
52:40 percentage
52:42 of all information in a cell
52:46 that goes to certain genes yeah they
52:48 have these weird names they’re
52:49 completely
52:50 irrelevant but you see that this gene on
52:53 the top here
52:54 that has an even weirder name
52:57 in some cell comprises eighty percent of
53:00 the
53:02 information yeah and that
53:05 does not make any biological sense but
53:08 because if you have
53:09 thirty thousand genes in a cell it can’t
53:11 be that basically the cell
53:13 is packed completely packed with
53:15 products from a single gene you know
53:17 that that’s
53:17 cannot happen in real life and that’s
53:20 why we see that here there are a lot of
53:22 cells
53:23 yeah so everything where we have more
53:25 than maybe 30 percent here
53:27 where we don’t have any reasonable
53:31 information now that means that we need
53:34 to do quality control now we
53:36 need to filter out cells that actually
53:40 have meaningful information and we have
53:42 to
53:42 keep away cells that don’t have
53:44 meaningful information
53:46 yeah and what we do is we look at such
53:50 histograms here
53:52 so we took a look at these histograms
53:54 and we calculate probability densities
53:58 over all cells so all columns of this
54:01 matrix
54:02 and on the x-axis here is the
54:06 total amount of information that we have
54:09 for
54:09 for a cell so the total number of
54:11 molecules that we detected
54:13 for a single cell and you can see this
54:15 follows a distribution
54:18 and the first thing that you see there
54:20 are two blocks now some cells are worse
54:22 and some cells are better so there is
54:25 already some variance in the data
54:27 just because the quality of our
54:30 measurement
54:31 is different for two between two groups
54:33 of cells
54:35 but all of these are actually good
54:37 values yeah they’re all good
54:39 yeah and uh we just take out some cells
54:42 here that is the
54:43 vertical line that are below one million
54:46 of these counts now these we throw away
54:51 now we can also
54:53 see look at other accounts so um
54:59 so this is for example how many genes
55:01 that we detect
55:02 and here we also cut off
55:06 here we also cut off
55:09 basically cells that are low quality
55:12 in these tails here we just remove them
55:14 from the data set
55:16 because we know that if we kept them in
55:18 the data set in the long term
55:19 we would have problems with
55:21 dimensionality reduction now they would
55:23 these things dominate in the end
55:26 things that are based on machine
55:28 learning clustering
55:29 dimensionality reduction so we remove
55:31 them from the data set
55:34 and now that’s sorry excuse me
55:36 replications
55:37 yes so is there any uh is a systematic
55:41 way to set
55:43 the threshold to filter out
55:46 in this case uh it’s a matter of
55:49 experience
55:50 yeah there’s not a systematic way
55:53 normally
55:53 [Music]
55:55 you would set the threshold in such a
55:57 data set
55:58 you would set the threshold here in the
56:00 middle between the two
56:02 peaks but between but because the two
56:04 peaks are both at reasonable values
56:06 and they’re all of the same height yeah
56:09 then
56:10 uh you know i said we would get we would
56:12 lose 50
56:13 of the data and that’s a little bit too
56:15 much yeah but they’re all reasonable but
56:17 we have to check later
56:19 that if we find two groups of cells in
56:22 the data
56:23 that these two groups are not
56:26 just representing uh these two peaks
56:30 here in the quality of the measurement
56:32 yeah the later we have to we now go go
56:35 on with analysis
56:37 and if there’s something is suspicious
56:39 we go back to this stage here
56:42 and we might have to be more rigorous
56:45 with this
56:45 cut off here yeah so in this case in
56:49 this particular case there’s no rigorous
56:51 way of doing that
56:52 it’s a matter of how much do you expect
56:54 what is the good measurement
56:56 and uh basically now this is this is a
56:59 this is a
57:00 this is a pretty good example in terms
57:03 of these counts here
57:04 now if you if you’re working on genomics
57:06 so this is this is actually zebra fish
57:08 so we have less than in other animals
57:11 in total and um but but here we don’t
57:15 have basically
57:17 sometimes you have a another peak here
57:20 at very low values
57:22 and that we would then completely cut
57:24 off
57:26 so here the problem more is this one
57:28 here this plot
57:30 we have a lot of cells that have a high
57:33 percentage
57:34 of mitochondrial accounts so that that
57:37 are genes
57:38 or the dna that is in the mitochondria
57:42 and these genes do not produce
57:45 this dna does not produce many gene
57:48 products
57:49 yeah so it suspicious if you have too
57:51 much of that in this
57:53 in these cells and here we take out
57:56 roughly i guess 20 or 30
57:59 20 20 of the cells
58:02 we lose in this step
58:05 yeah and we can also plot both with
58:08 respect to each other
58:09 for example we can here on the x-axis
58:11 have these different values that
58:13 represent the quality of our data
58:15 and then just in half these bars these
58:19 vertical and lines that we had in the
58:20 histograms
58:22 just in this scatter plot here yeah and
58:25 cut
58:25 and see which what are the kinds of
58:27 cells that we use here
58:29 uh visually yeah
58:33 okay now so now now we get rid of the
58:36 bad stuff the other things that are
58:38 totally crazy now the next thing
58:41 we need to do is to make
58:44 cells comparable so cells
58:48 still have different uh cells that
58:51 have different measurement qualities
58:54 they’re still
58:55 different based on technical reasons so
58:57 for some cells we have a lot of
58:58 information
59:00 now we have a lot of these counts that
59:02 we detected
59:03 and in other cells we have less but we
59:06 want to make them comparable to each
59:08 other
59:08 and that’s why we have to normalize the
59:11 data
59:12 yeah and there are fancy ways in
59:14 genomics there are fancy ways of doing
59:16 this and you can see we’re
59:17 doing all of that basically and uh
59:21 so but we have to normalize the data you
59:24 have to make them comparable that’s what
59:26 you always have to do
59:27 and what we also do here is because
59:30 these
59:30 these uh data counts you could see
59:34 like in the matrix that i showed you
59:36 there were very large numbers and very
59:38 small numbers
59:40 yeah and what we have so these
59:43 counts they live on an exponential scale
59:46 that’s their these distributions are
59:48 very skewed it’s a few cells that have a
59:51 huge amount or a few genes
59:52 that have a huge amount of these counts
59:55 of these
59:56 measurements yeah and that’s not
59:58 something that works very well with
60:00 dimensionality reduction
60:02 or clustering methods so we take the
60:04 logarithm
60:05 we log transform the data for
60:08 any further processing now that’s also
60:11 something that you do if your
60:13 data is too spread the other it comes
60:15 from some
60:16 exponential process also yeah then you
60:19 have the log transformers you want it to
60:21 be something
60:22 like normally distributed something
60:24 symmetrically and
60:25 rather compact as a distribution
60:30 yeah okay let’s go on so here we can do
60:33 a variance stabilizing
60:35 uh transformation and we do more stuff
60:37 on the data
60:40 and then we can go and now we can start
60:44 to understand the data the first thing
60:46 we need to do
60:48 is to see does it actually make sense
60:51 what we do
60:52 what we have here are we actually
60:54 looking at biological information and
60:56 real information
60:57 or are we just looking at technical
61:00 aspects
61:01 of the experiments and as a first step
61:05 what we do here is we
61:08 plot a little pca fabian showed that
61:11 last week what a pca is
61:13 and these data on the pca on the
61:15 principal component analysis plots
61:18 they look like this and you can see
61:21 so here i plot for example the total
61:24 amount of these counts in a cell you
61:27 know that’s a
61:28 that’s a technical that’s just measuring
61:31 the quality
61:33 of the measurements and you can see
61:35 there is some variability here so that
61:37 these cells have a little bit more these
61:40 cells are a little bit more
61:41 less and some of this technical
61:43 variability is captured
61:46 by the um by these experiments
61:50 by this principle component analysis so
61:53 here we’re fine
61:54 with it it’s not extreme we don’t have
61:56 disconnected clusters
61:57 so we are fine with this it’s already in
62:00 a good shape
62:01 and also we know that in cells can have
62:05 these differences
62:06 in the total number of molecules in the
62:09 cell for biological reasons
62:13 now and then we can look at these plots
62:15 here what is the percentage of variance
62:18 actually explained by certain principal
62:21 components now that’s actually the
62:23 y-axis is
62:24 something else but we can order these
62:27 principal components
62:29 based on how much they explain the total
62:31 data
62:33 and what we do and that this is sort of
62:35 saying like a very professional
62:37 way of doing things is we do all other
62:39 calculations not on the real data
62:43 but on the principal components of the
62:45 data yeah that’s an
62:46 intermediary step that we do just to get
62:49 cleaner results in the end
62:51 so so what we do here is we take the
62:53 first
62:54 20 25 or so principal components
62:58 that constitute like 99 of the variance
63:01 of the total
63:02 plot and we say the rest is noise that’s
63:04 a way of getting rid of the noise
63:06 in the data and now we go on
63:10 yeah and do dimensionality reduction
63:12 further dimensionality reduction
63:16 let me make this larger
63:24 now i hope you can see these plots
63:28 so this is a u map yeah so this is the
63:31 this is a umap
63:32 uh that is a non-linear way of reducing
63:36 the dimensions that
63:37 fabian showed to you last week
63:40 and you can see once we do the
63:42 non-linear dimensionality
63:44 reduction our data looks already much
63:47 more structured so these cells here
63:51 are from actually from the brain these
63:53 are brain cells
63:55 and of course there are different kinds
63:57 of cells in the brain
63:59 and because they’re different kinds of
64:01 cells in the brain we also
64:03 expect here a structure
64:06 to pop up in these low-dimensional
64:09 representations
64:11 typically these clusters correspond to
64:13 different kinds of cells
64:16 yeah so we don’t know that’s just a gray
64:18 bunch of cells here
64:19 of dots and the two dimensional planes
64:21 we don’t know what the x’s are
64:23 and we don’t know what these cells are
64:25 and now we have to dig a little bit
64:26 deeper
64:28 and we do clustering so here is
64:31 clustering
64:32 and that’s clustering that’s one of
64:35 these community based clustering
64:37 algorithms
64:39 and we did that for several
64:43 resolutions of the clustering yeah so in
64:45 clustering
64:46 in most of the time you have to tell the
64:48 algorithm
64:49 how many clusters you want yeah
64:53 and uh that’s that’s kind of a
64:54 resolution
64:56 that you give to these algorithms and uh
64:59 by doing this
65:00 you uh and what you see here are
65:03 different clusterings
65:06 with different resolutions now so here
65:09 you say okay
65:10 give me what is that 15 clusters
65:13 then you got this plot on the bottom
65:14 left if you say okay
65:16 give me what is it eight clusters then
65:19 you get this
65:20 these clusters on the top left
65:24 and you can have more clusters if you
65:27 want yeah
65:28 they’re different stages but we don’t
65:30 know yet
65:31 what makes sense yeah we don’t know how
65:33 many real clusters there are in the data
65:37 but we can take all of them and we can
65:39 go one step further
65:41 how do we know that such a cluster is a
65:43 real biological cluster
65:46 we know that if the cells in a cluster
65:50 all share some property that is not
65:53 shared
65:54 by other that is not shown by other
65:57 cells
65:59 now then we know that this cluster is
66:01 something real
66:02 uh that really is going on in the brain
66:06 and the way we do that is
66:09 let’s go down then we look at the
66:12 literature
66:13 so now we look at the literature we look
66:15 at papers
66:16 yeah so we say we look at papers and
66:18 then in these papers
66:20 we see okay there are different cells
66:22 and kinds of cells in the brain
66:25 and people have done experiments
66:28 genetic experiments for example where
66:30 they found for example
66:32 that stem cells express a certain gene
66:36 now for example this gluolar gene here
66:38 that’s expressed by stem cells
66:40 and now we plot this umap with the color
66:45 representing how much of the products of
66:48 this glue
66:49 gene we found in a certain cell
66:53 and now that makes a little bit sense
66:54 and now here in this corner here on the
66:56 top left
66:58 these are our stem cells
67:01 and then we can see okay so there are
67:03 also neurons in the brain and
67:05 many other things let’s what’s going on
67:07 yeah so there’s another cell type that
67:09 expresses this
67:11 cell yeah that’s the more advanced cell
67:13 type
67:14 from stem cells yeah that’s all now in
67:17 the next step here let’s express
67:18 in these cells here and then you could
67:21 can go down
67:22 and identify all of these clusters
67:27 step by step and identify
67:30 what kind of cells you have in the data
67:34 yeah you can do that with more genes
67:37 even
67:38 so there’s a lot of genes like these
67:40 genes here that identify neurons
67:43 and different kinds of neurons so here
67:45 we have a little
67:46 just feature of this plot that is
67:48 identified by this gene
67:50 and if you talk bibologies all of these
67:52 names uh
67:54 are associated with different shapes or
67:56 different functions of cells
67:59 we can also do more fancy stuff and to
68:01 look at different gene scores and add
68:03 groups of genes
68:05 do statistical computations yeah and
68:08 once we’ve do
68:09 done that we decide okay
68:12 for these set of genes here we have a
68:15 unique
68:16 we fulfill this condition that each of
68:18 these clusters here
68:19 for these clusters we fulfill the
68:21 condition
68:22 that each of these clusters has
68:26 a certain biological function or
68:28 represents a biological function
68:30 because we found a gene in the
68:32 literature
68:33 that corresponds to a certain cell type
68:35 in the body
68:37 that is expressed in one of them but not
68:39 in the others
68:41 yeah so that’s uh that’s very important
68:45 and then we can give these clusters
68:47 names here for example
68:49 retinol or radioglia cells
68:52 uh uh only oligodendrocyte
68:56 precursor cells and so on and neurons
68:59 and then you find okay so these orange
69:01 ones here are the neurons
69:02 yeah and uh here were the stem cells
69:06 and then you can start thinking okay
69:08 these sensors somehow
69:09 turn here into these neurons and they
69:12 mature
69:13 now they get they get more mature and
69:15 then at the end they turn into these
69:17 neurons here
69:18 and then we have other cell types in the
69:20 brain like microglia
69:23 and so on that we also can find here
69:27 now remember in the u map the
69:31 distances in the u map is the humor
69:35 keeps the global topology intact
69:38 now that means that cells that are close
69:40 in here are
69:42 actually also very similar on uh
69:45 in the in this high dimensional space
69:49 so it’s actually tempting that you think
69:50 about these
69:52 paths here as a trajectory that cells
69:55 take while they take from while they go
69:58 from stem cells
70:00 into neurons in the brain
70:04 now so let’s go on now there’s a lot of
70:06 consistency checks
70:08 yeah so you have to check all kinds of
70:10 genes
70:11 yeah a lot of them and discuss a lot of
70:14 with the people who actually know
70:16 who do nothing else in their lives but
70:17 to look at these cells and the brains
70:19 and they know
70:20 all of these genes and all of the papers
70:22 in these genes
70:23 yeah and i also do more fancy stuff
70:28 yeah and uh then you can do with
70:32 different clustering uh and now we have
70:35 a
70:36 identification and this one i said is
70:38 the one
70:40 that we can live with with these eight
70:42 clusters
70:43 while for these higher clusters now we
70:46 have
70:47 several several clusters representing
70:49 the same cell type
70:51 and we can in principle come back later
70:53 to these higher clusters
70:56 and typically biologists once want you
70:58 to do to get as many clusters
71:02 as you can yeah um
71:05 yeah so now we have these classes and we
71:06 can have some measurements
71:08 of how good these clusters are actually
71:12 and there are specific plots for these
71:15 for this
71:15 for example these are called dot plots
71:18 and
71:19 um what these plots show is on the
71:22 x-axis
71:23 our gene names and on the y-axis
71:27 are the cluster names yeah
71:31 and the color represents how much
71:34 a gene on average is present in the
71:38 cluster
71:40 and the size of these plots tells you
71:44 how what is the fraction of cells that
71:47 have this gene on
71:49 in this cluster now so and if we go now
71:53 here
71:54 what we want to see is that these genes
71:56 that we have here
71:58 are only on in one cluster but not in
72:01 other clusters
72:02 now for example a good thing is this one
72:05 here
72:06 this only go this oc cluster
72:10 now there’s only one cluster here
72:13 there’s only there are these these genes
72:16 here that we have
72:17 identified for this cluster that is
72:20 called
72:21 marker genes the other computation tools
72:23 to detect them
72:25 they are only you only find that in this
72:28 cluster
72:28 but not in the other cluster same for
72:32 this mg this microglia
72:34 we only find these genes in this single
72:36 this one cluster but not in any other
72:39 cluster
72:40 and while we go for what’s a messy one
72:44 opc’s
72:45 or we go for this neurons here
72:49 then it’s more messy and it’s not so
72:53 clearly defined
72:54 so if we go back now for example
72:57 now if we go here if we look at these
72:59 plots then these
73:01 mg’s microglia they were very clean
73:04 in this plot that i just showed you yeah
73:07 and that’s also represented in this new
73:09 new
73:10 you map they’re a different cell type
73:14 that is not produced presumably by the
73:17 same stem
73:18 cells as the neurons
73:21 also these ocs yeah as oligodendrocytes
73:25 i think
73:26 yeah that’s separated from the rest so
73:29 there’s no overlap here there’s a
73:30 distinct cell types
73:33 while for things like that we had an
73:35 overlap between these material neurons
73:38 at the sixth cluster that’s called six
73:40 here
73:41 yeah there we found that
73:44 there is an overlap actually in these
73:46 marketers they have the same genes
73:48 that they express and probably that
73:50 means that this cluster six
73:52 is an artifact yeah that we cannot take
73:55 too seriously
73:56 and on the right hand side you can see
73:58 that in the next step now we took these
74:01 i have went a little bit broader and
74:03 took this cluster six
74:04 into the interval neurons here
74:08 now so now of course i showed you
74:10 basically a finalized version
74:12 normally you go back between these steps
74:14 and the earlier steps
74:16 again and again the call it back to the
74:17 quality control
74:19 until you have that do not have any
74:21 trace
74:22 of experimental technical parameters
74:25 in your plots and then also you go back
74:29 between these clusterings your
74:31 experimental friends
74:32 who do the experiments and the
74:35 literature
74:35 until you find something that is really
74:37 corresponding to
74:40 to to what makes biologically sense
74:43 it’s not that these techniques give you
74:46 automatically
74:47 something uh that this was like a
74:50 mathematical criterion
74:53 that would tell you uh what makes sense
74:55 and you have to do that always have to
74:56 do that yourself
74:59 that’s not that you push a button and
75:00 then everything is
75:02 works automatically and that’s why these
75:04 people who do these kind of
75:06 analysis are very much looked looked for
75:09 on the job market
75:12 okay so uh let me just see if there’s
75:15 any
75:16 other anything other interesting i think
75:17 there’s you can go on and on forever you
75:20 know you can give the experimental lists
75:22 you know so which genes are on and off
75:25 in which cells
75:26 and once the experimentalists have these
75:29 lists they can do more
75:30 experiments uh they can create for
75:33 example
75:34 new animals that lack these genes
75:37 and what i want to show you
75:40 is now something that’s on the bottom
75:44 you know so that the analysis is very
75:46 lengthy
75:50 now there are many aspects of course
75:51 that are irrelevant for biology
75:54 um i just want to show you
76:01 a lot of stuff a lot of stuff now that’s
76:04 what the biologists are interested in
76:05 right
76:06 so these they hold these jeans um
76:09 all the stuff all the stuff a lot of
76:11 stuff even more stuff
76:20 yeah if more more more more also a lot
76:22 of calculations
76:24 yeah more heat maps to check
76:28 you know which genes are on where and so
76:30 on yeah so that’s all
76:32 uh uh consistency checks
76:35 and one thing i want to just to show you
76:38 is uh something that’s called let me see
76:40 do we have
76:42 [Music]
76:43 okay is something that’s called
76:46 trajectory inference you know that’s
76:48 so so so these are cells from a brain
76:52 and what cells do is they start that
76:54 they divide and they produce other cells
76:57 and then they get more specialized over
76:59 time
77:00 so these cells they start as a stem cell
77:03 and then
77:04 they mature until they at some point
77:06 they are in you
77:07 a new one so what we did here is
77:11 uh what you did here and what you do in
77:12 many cases is you take
77:14 okay i have a snapshot data so this
77:17 fish or these fish were killed for the
77:20 measurement
77:21 i don’t have a time course or anything
77:23 now i don’t uh as a snapshot measurement
77:26 just one measurement
77:28 but i have different cells that are at
77:30 different stages of this dynamic process
77:33 of cell maturation so what then people
77:36 do is uh can we get the temporal
77:39 information back
77:40 yeah and this is the trajectory
77:42 inference of how
77:44 these different clusters and cell types
77:46 relate to each other
77:48 and where you then can calculate rates
77:51 of how
77:51 one cell go one cell type
77:54 the flux that from one cell type leaks
77:57 leads to another cell type
77:59 now and in principle based on this you
78:01 can then come up with stochastic
78:03 models that you can in principle also
78:05 compare them to
78:06 uh to other experiments or theoretical
78:09 work
78:10 now that’s i just wanted to show you uh
78:13 at the end
78:13 and let me just go through here if
78:15 there’s anything else that’s worth
78:17 showing you
78:18 um a lot of stuff now then you can
78:21 compare to humans and other animals
78:24 you can see where there are similarities
78:27 uh
78:27 these fish here that we’re looking at
78:29 are very interesting because they can
78:31 regenerate their brain they can build
78:33 new neurons
78:34 that’s something we cannot do so it’s so
78:36 much that we want
78:38 we want to learn what are the
78:39 similarities and difference why are we
78:41 not
78:41 able to do that as humans yeah and
78:45 um yeah and then uh
78:48 we are already you know at the end of
78:51 this very lengthy analysis
78:53 yeah so this is a typical data science
78:56 project
78:57 from start to end and so you can see
79:00 here a mixture of
79:01 r and python and
79:05 the important thing is that you cannot
79:07 what we showed you
79:09 in the last lectures you cannot just go
79:11 and take the data and
79:12 throw a umap on it or do some machine
79:15 learning on it
79:16 so a large part of the data
79:20 of this pipeline here is to actually
79:24 clean up the data and think about what
79:26 is the part of the data that makes sense
79:29 and what is the part of the data that
79:31 does not make sense
79:33 yeah so for example think about this new
79:34 york city flights
79:36 it doesn’t make sense to have a negative
79:37 departure delay and it was a
79:39 it was a very uh that was a very good
79:41 question
79:42 yeah it actually uh makes sense and this
79:44 is so to say a data set
79:46 that is used for for teaching a lot so
79:48 it’s already cleaned up a lot but
79:50 typically you expect
79:51 to have a lot of nonsensical
79:54 measurements
79:55 uh in your data yeah or sometimes you
79:57 have a departure delay of
79:59 10 billion years or so that that would
80:02 be the correspondence in a real data
80:04 somebody made a typo somewhere yeah and
80:06 then you have that in your data set and
80:08 you have you have to
80:09 have to filter it out again now this
80:11 happens all the time if you
80:13 are not looking after this yeah so if
80:16 you’re not taking care of this
80:18 then all of these nice plots here that i
80:20 showed you
80:21 they won’t work now they only work
80:23 because we clean up the data
80:25 we normalize the data we transform the
80:28 data in a statistical way to make it
80:31 uh nicely behaved yeah it’s
80:34 no huge outliers or so and then these
80:37 methods
80:38 like uh umap and so on work on the data
80:42 but you always have to do these
80:44 pre-processing steps
80:46 before that to make that work and the
80:47 next step is you you have these fancy
80:49 methods
80:50 but taking a loan they don’t make sense
80:53 so you can see here how this order has
80:56 ordered this
80:56 trajectories cells move here along in
80:59 time
81:00 move along this line and turn into
81:02 neurons here
81:04 and of course we have many different
81:05 neurons in the brain
81:07 yeah and but to understand that what’s
81:09 happening here this data to make sense
81:11 of this
81:12 now you have to come up with hypothesis
81:15 you have to connect these hypotheses
81:16 with
81:17 what is already out there in the
81:18 literature and then step by step you can
81:21 construct
81:22 an understanding about what are actually
81:24 here the degrees of freedom
81:26 that you see in this data set you know
81:29 and so this is this is an example and
81:32 this
81:32 is an iterative process that you improve
81:36 over time and so this is an example
81:39 that’s a purity
81:40 purely data science project or there’s
81:43 very little physics in it
81:45 and next time we show you how all of
81:47 these connects to something that we
81:49 actually
81:51 did in the first part of this lecture
81:53 namely field theory
81:54 face transitions criticality and so on
81:58 okay great so i’ll stay online and in
82:01 case there are any questions otherwise
82:03 see you next week
82:08 bye
82:21 thank you it was very interesting yeah
82:24 so when are you when are you going


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK