WEBVTT

00:00:06.000 --> 00:00:07.300
Hi, I'm Sharon.

00:00:07.300 --> 00:00:11.400
Machlis at idg communications here with episode 50 of do more

00:00:11.400 --> 00:00:13.200
with our five things.

00:00:13.200 --> 00:00:16.700
You may not know about datatables every function.

00:00:16.700 --> 00:00:22.000
F read Imports external data files into our if it's a data.

00:00:22.000 --> 00:00:25.100
Table function, you know, it's fast very fast.

00:00:25.100 --> 00:00:29.500
But this morning every then speed it has several helpful

00:00:29.500 --> 00:00:33.500
option. You might not know about or maybe you know about them

00:00:33.500 --> 00:00:36.200
once but since stop using them, let's take a look.

00:00:37.200 --> 00:00:41.300
We'll start with a file of daily covid-19 data from every County

00:00:41.300 --> 00:00:45.200
in the US a 12 GB file from The New York Times on GitHub.

00:00:45.200 --> 00:00:48.900
If you'd like to follow along download the file from the URL

00:00:48.900 --> 00:00:49.600
on screen.

00:00:56.800 --> 00:00:59.200
Tip one is your file a large.

00:00:59.200 --> 00:01:02.200
Would you like to examine its structure before importing

00:01:02.200 --> 00:01:05.500
the whole thing without having to open it in the text editor

00:01:05.500 --> 00:01:09.200
or Excel use f reads and Rose auction?

00:01:09.600 --> 00:01:13.100
Is a line of coke creating my dt10.

00:01:13.100 --> 00:01:17.800
I'm importing just the first 10 rows of a CSV the first let me

00:01:17.800 --> 00:01:22.000
load the data table package and create my DTA 10.

00:01:25.100 --> 00:01:26.500
Now we take a look at it.

00:01:26.500 --> 00:01:28.600
We can see all the columns.

00:01:29.500 --> 00:01:33.000
And we can look at it in data Studio as well.

00:01:37.500 --> 00:01:40.600
If you just want to see column names without any data at all,

00:01:40.600 --> 00:01:42.900
you can use En Rose equal 0.

00:01:46.100 --> 00:01:47.900
And here in the bottom left pain,

00:01:47.900 --> 00:01:49.900
you can see the car names.

00:01:53.600 --> 00:01:57.100
Tip to once you know the file structure you can choose which

00:01:57.100 --> 00:02:01.100
columns to import F read select option lets you pick calms

00:02:01.100 --> 00:02:04.100
you want to keep and they'll appear in your data.

00:02:04.100 --> 00:02:08.300
Table in the same order as you name Sam select takes a vector

00:02:08.300 --> 00:02:12.700
or either call him names or, position numbers if names they

00:02:12.700 --> 00:02:15.800
need to be in quotation marks like most directors of character

00:02:15.800 --> 00:02:18.200
strings as you see in this first line of code.

00:02:23.600 --> 00:02:26.600
As always numbers don't need quotation marks.

00:02:28.900 --> 00:02:33.800
And you can see I've got a data table with 300 22651 Rose.

00:02:35.700 --> 00:02:39.700
You can use an object with a vector of column names inside

00:02:39.700 --> 00:02:42.900
every as you can see in the second group of code first.

00:02:42.900 --> 00:02:47.800
I create a vector my underscore calls with date County State

00:02:47.800 --> 00:02:51.000
and cases that I use that Vector inside their fries.

00:02:53.600 --> 00:02:54.800
It works the same.

00:02:57.000 --> 00:03:01.300
The opposite of Select is drop you can choose to import all columns

00:03:01.300 --> 00:03:04.600
except the ones you specify with drop as in this third group

00:03:04.600 --> 00:03:05.200
of code.

00:03:06.800 --> 00:03:09.600
Here, I want everything but dips and deaths.

00:03:11.700 --> 00:03:14.000
This works with column numbers as well.

00:03:17.800 --> 00:03:22.200
Tip number 3 if you familiar with Unix you can combine are free

00:03:22.200 --> 00:03:25.300
with command line tools right from inside F read,

00:03:25.300 --> 00:03:28.600
for example, if I just want to California data,

00:03:28.600 --> 00:03:32.800
I could use grep to only import lines that contain the text

00:03:32.800 --> 00:03:37.900
California note that this search is each entire row as a single

00:03:37.900 --> 00:03:40.500
text string not a specific column.

00:03:40.500 --> 00:03:44.400
So your data has to be a format where this makes sense in this

00:03:44.400 --> 00:03:45.500
first line of code.

00:03:45.500 --> 00:03:47.500
You see the grep command to find the expression,

00:03:47.500 --> 00:03:49.900
California in the US counties file.

00:03:53.100 --> 00:03:54.400
Let's see what this looks like.

00:03:58.700 --> 00:04:02.100
Unfortunately grip doesn't understand the original files,

00:04:02.100 --> 00:04:04.500
names. So you wind up with the fault names,

00:04:04.500 --> 00:04:10.000
but if we'd like to specify calling names also with the called.

00:04:10.000 --> 00:04:13.900
Names are I can put the names back using the names for my my

00:04:13.900 --> 00:04:18.100
dt10 data table that small data table that created with just

00:04:18.100 --> 00:04:18.900
10 rows of data.

00:04:20.700 --> 00:04:22.900
You can see that here.

00:04:29.800 --> 00:04:31.600
Now I thought my calling names back.

00:04:35.300 --> 00:04:39.200
This is an example of using regular expressions with grep stash

00:04:39.200 --> 00:04:43.000
option letting us do more complex searches such as looking

00:04:43.000 --> 00:04:44.000
for four states at once.

00:04:50.200 --> 00:04:53.200
you see I got that day to hear now in States for

00:04:55.200 --> 00:04:59.200
But once again a reminder this is looking for each of those

00:04:59.200 --> 00:05:01.400
fake names anywhere in the world.

00:05:01.400 --> 00:05:05.600
Not just in the steak on if I check what states are included

00:05:05.600 --> 00:05:09.500
in my results looking at unique values for the state column.

00:05:10.500 --> 00:05:14.900
You can see I got Oklahoma and Missouri in there in addition

00:05:14.900 --> 00:05:16.400
to the four states I asked for.

00:05:20.700 --> 00:05:23.400
I happen to know that's because both Oklahoma and Missouri

00:05:23.400 --> 00:05:25.400
have counties named, Texas.

00:05:25.400 --> 00:05:28.600
So this is a way to filter out a lot of data,

00:05:28.600 --> 00:05:31.200
you know, you don't want from a very large dataset,

00:05:31.200 --> 00:05:34.600
but after this kind of important you'll still want a filter

00:05:34.600 --> 00:05:39.100
specifically bicom afterwards to make sure you get only the data

00:05:39.100 --> 00:05:41.600
you want that's a file group of cohere.

00:05:43.500 --> 00:05:49.600
I'm using optimized for Speed chin operator like Besar as in but

00:05:49.600 --> 00:05:50.100
faster.

00:05:53.100 --> 00:05:56.600
Select rows where the state column is any of those four states

00:05:56.600 --> 00:05:59.900
not just anywhere now if I check what states are in my data?

00:06:01.400 --> 00:06:02.800
It's only the ones I wanted.

00:06:03.800 --> 00:06:08.800
Good for you can set column classes during import for just a few

00:06:08.800 --> 00:06:12.800
columns. You don't have to name everyone for example that they

00:06:12.800 --> 00:06:16.400
call him in the state is coming in as character strings even

00:06:16.400 --> 00:06:18.200
though it's in a year month day format.

00:06:19.300 --> 00:06:21.200
As you can see here.

00:06:24.800 --> 00:06:29.300
We can set the column name date that's lowercase D to the datatype

00:06:29.300 --> 00:06:33.700
date. That's capital D in the import using the call classes

00:06:33.700 --> 00:06:34.200
option.

00:06:35.700 --> 00:06:40.900
Here. I'm sitting lowercase date, name uppercase date data type

00:06:40.900 --> 00:06:41.600
date.

00:06:50.000 --> 00:06:51.500
Now dates should be dates.

00:06:53.100 --> 00:06:56.300
And you can see in the bottom left panel dates are indeed

00:06:56.300 --> 00:06:56.700
fixed.

00:07:00.800 --> 00:07:05.500
And tip number 5 you can import a zip file without unzipping a

00:07:05.500 --> 00:07:08.100
first every can import.

00:07:08.100 --> 00:07:11.900
Jeezy and. Be the two files directly here.

00:07:11.900 --> 00:07:13.600
I'm importing a local gzip file.

00:07:14.700 --> 00:07:15.700
That's what the funny.

00:07:18.000 --> 00:07:22.600
And you see I brought my data if you need to report a.

00:07:22.600 --> 00:07:26.100
Zip file, you can unzip it with the unzip system Command

00:07:26.100 --> 00:07:30.300
right with an F read like in the second line of code using

00:07:30.300 --> 00:07:32.600
the system command unzip with a couple of options.

00:07:36.900 --> 00:07:38.100
And that's it for this episode.

00:07:38.100 --> 00:07:39.800
Thanks for watching.

00:07:39.800 --> 00:07:43.600
Some more are tips had to do more with our page at bit.

00:07:43.600 --> 00:07:48.300
Ly / do more with our all lowercase except for the are you can

00:07:48.300 --> 00:07:52.600
also find to do more with our playlist on the YouTube idg techtalk

00:07:52.600 --> 00:07:55.300
channel where you can subscribe so you never miss an episode.

00:07:55.300 --> 00:07:57.000
Hope to see you next time.

00:07:57.000 --> 00:07:58.700
Stay healthy and safe everyone.