40 Practices on Stock Data Processing

3. 40 Practices on Stock Data Processing#

3.1. Load Packages#

library(data.table)  # the best data manipulation package
library(stringr)  # process strings
library(magrittr)  # pipe operator, e.g., "%>%"
library(lubridate)  # for date and time processing

About the Code Comments

For instruction purpose, I tried to provide as many comments as possible, which may render the code unnecessarily long and scary. For example (Question 6):

data[
    # filter stocks
    retx>=0.05 | retx<=-0.05,
    ][, 
    # create `direction` to indicate price change direction
    ':='(direction=ifelse(retx>=0.05, 'up5%+', 'down5%-'))
    ][, 
    # count the stocks
    .(num_stk=uniqueN(permno)),
    keyby=.(date, direction)
    ][1:5]

However, if you take out the comments and rearrange the code, it’ll be much more friendly:

data[retx>=0.05 | retx<=-0.05,
    ][, ':='(direction=ifelse(retx>=0.05, 'up5%+', 'down5%-'))
    ][, .(num_stk=uniqueN(permno)),
    keyby=.(date, direction)
    ][1:5]

3.2. Inspect the Data#

We select the price and trading data of all US common stocks from the CRSP database between 2023-01-01 and 2023-06-30. Each row represents a stock’s daily observation. There’re in total 4221 unique stocks with about 465k rows.

Observation vs. Row

In my book, I use “observation” and “row” interchangeably, which is common in the field of data science.

As a related note, “field,” “column,” “variable,” and “feature” are also interchangeably to indicate a column of a table. While statisticians prefer “variable”, data scientists use “feature” more often. “Field” and “column” are often seen in the context of database.

# read data
data = readRDS("data/crsp.rds")

# convert to data.table
setDT(data)  

# print the first 3 rows
data[1:3]

A data.table: 3 × 14
permno	date	ticker	open	high	low	close	cap	dlyfacprc	vol	prcvol	retx	industry	is_sp500
<dbl>	<date>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<lgl>
11790	2023-01-03	ALCO	23.91	24.02	23.42	23.83	180941.2	1	115286	2747265	-0.001675744	Agriculture, Forestry, Fishing and Hunting	FALSE
11790	2023-01-04	ALCO	23.81	24.37	23.68	23.99	182156.1	1	46646	1119038	0.006714226	Agriculture, Forestry, Fishing and Hunting	FALSE
11790	2023-01-05	ALCO	23.75	24.06	23.53	23.79	180637.5	1	61673	1467201	-0.008336807	Agriculture, Forestry, Fishing and Hunting	FALSE

This is “panel datak,” meaning it contains multiple stocks (cross-sectional) with each stock has multiple observations ordered by date (time-series).
Each combination of permno (stock ID) and date determines a unique observation, i.e., they’re the key.
The dataset is first sorted by permno, and then date
Column definition:
- permno: The unique ID provided by CRSP.
- date: Date
- ticker: Ticker. Note that the ticker of a stock can change, and hence it’s not recommended to use ticker as stock ID.
- industry: The industry classification determined using the first two digits of NAICS industry classification code. See the official document for more details.
- is_sp500: TRUE if the stock is in the S&P 500 list by the mid 2023; FALSE if not.
- open: Daily open
- high: Daily high
- low: Daily low
- cap: Daily market capital (in $1000)
- dlyfacprc: Daily factor to adjust for price. (We won’t use it in the practice.)
- vol: Trading volume (number of shares traded)
- prcvol: Dollar trading volume, calculated as the number of shares traded multiplied by the price per share.
- retx: Daily return (excluding dividend).

Warning

Without specifying otherwise, we mean prcvol (dollar trading volume) whenever we say trading volume. The reason is simple: only dollar trading volume are comparable across different stocks.

3.3. Find out all stocks whose ticker contains `"COM"`#

We use the str_detect function (from the stringr package) to find out all ticksers that have a "COM". But that’s only half the way. Without deduplicating (the unique() function), we will end up with numerous duplicate observations whose permno and ticker are the same but only differ in date.

We provide no arguments to unique(), meaning R needs to consider all columns.

data[str_detect(ticker, 'COM'), .(permno, ticker)
    ] %>% unique()

A data.table: 6 × 2
permno	ticker
<dbl>	<chr>
77178	QCOM
20421	COMS
14239	COMM
19136	MCOM
92716	DCOM
20916	COMP

Execution explained:

First, R executes str_detect(ticker, 'COM') to filter out the needed rows.
Then, it only keeps two columns, .(permno, ticker). By then, we have many duplicates.
Lastly, we use unique() to remove the duplicates.

3.4. How many stocks on each day have positive/negative returns?#

Since we want to count the stocks on each day, We need to first group all stocks by date and then by the return sign (positive vs. negative). Next, we need to generate two counts: one for the up (positive return) and the other for the down (negative). We leverage the sign of retx (return) to determine the direction of price change.

# step 1: create a new column "direction" which indicates the direction of the price change
data_q2 = data[, ':='(direction = sign(retx))]
data_q2 = data_q2[!is.na(direction)]

# step 2: count the stocks with +/- returns on each day
data_q2[, .(num=uniqueN(permno)), keyby=.(date, direction)][1:5]

A data.table: 5 × 3
date	direction	num
<date>	<dbl>	<int>
2023-01-03	-1	1843
2023-01-03	0	97
2023-01-03	1	1834
2023-01-04	-1	993
2023-01-04	0	103

Output Explained:

On each day we have three observations (rows), one for each direction (1, 0, and -1) of the price change. For example, on 2023-01-03, there’re 1844 stocks with positive return.

Execution Explained:

In Step 1, we create a new column direction to indicate whether we’re counting the up/down/no-move stocks for each day. Its value is 1, 0, or -1, depending on the sign of retx.
In Step 2, I group all observations by date (since we want statistics for each day) and then direction. uniqueN() is used to count the unique stock IDs within each group.

However, we don’t need to split the code explicitly. One great thing about data.table is that it can chain multiple computation steps together using [. In the following code, I put the creation of direction in the key argument.

# chain the two steps together (same results)
data[, .(num=uniqueN(permno)),
    keyby=.(date, direction=sign(retx))
    ][!is.na(direction)][1:5]

A data.table: 5 × 3
date	direction	num
<date>	<dbl>	<int>
2023-01-03	-1	1843
2023-01-03	0	97
2023-01-03	1	1834
2023-01-04	-1	993
2023-01-04	0	103

You’re encourage to chain the steps!

As you can see, the chained version saves a lot of key strokes for you. Personally, I usually chain dozens of lines of code when I prototype. Once you get familar with data.table, the chained brackets will not look confusing but elegant and straightforward to you.

3.5. How many stocks on each day have positive/negative returns (only select stocks with market capital larger than $100M)?#

Q3 is very similar to Q2, the only change is to use cap to filter stocks by market capital. Note that cap is in 1000 dollars.

# we only add the condition "cap>=1e5" to the first step
# the rest is the same

data[cap>=1e5, 
    .(num=uniqueN(permno)),
    keyby=.(date, direction=sign(retx))
    ][!is.na(direction)
    ][1:5]

A data.table: 5 × 3
date	direction	num
<date>	<dbl>	<int>
2023-01-03	-1	1577
2023-01-03	0	72
2023-01-03	1	1454
2023-01-04	-1	806
2023-01-04	0	68

3.6. How many stocks are there in each industry on each day?#

data[, .(num=uniqueN(permno)),
    keyby=.(date, industry)
    ][1:3]

A data.table: 3 × 3
date	industry	num
<date>	<chr>	<int>
2023-01-03	Accommodation and Food Services	64
2023-01-03	Administrative and Support and Waste Management and Remediation Services	190
2023-01-03	Agriculture, Forestry, Fishing and Hunting	9

Execution Explained:

We first group by date and industry, then we count the unique stock IDs within each gruop using uniqueN()

3.7. Rank industries by their average daily number of stocks#

This question is very similar to the previous one. But instead of counting stock numbers for each day, we will generate an average daily stock number through out the whole sample period and use that to rank the industries.

data[, 
    # step 1: count the number of stocks in each industry on each day
    .(num_stk=uniqueN(permno)), keyby=.(industry, date)
    # step 2: get the average number of stocks in each industry
    ][, .(num_stk=round(mean(num_stk))), keyby=.(industry)
    # step 3: sort the industries descendingly by the average number of stocks
    ][order(-num_stk)
    ][1:3]

A data.table: 3 × 2
industry	num_stk
<chr>	<dbl>
Manufacturing	1276
Finance and Insurance	781
Professional, Scientific, and Technical Services	299

Execution Explained

We first count the number of stocks for each industry on each day, as in the previous problem.
We then compute the average stock number across the sample period. Note this time we group by industry instead of industry and date, this is because we want to average across the whole sample period.
order(-num_stk) sort the results. the minus sign - indicates sorting desendingly. Without - (the default) the sort is ascending.

3.8. On each day, is the industry with the most stocks also generates the largest trading volume?#

Some industries contain a lot of stocks, but it may not generate the largest trading volume (in dollars) because the component stocks may be very small. Let’s check if this claim is correct or not.

# for each industry, compute the total trading volume and stock number on each day
# we use `prcvol` as the trading volume
data_tmp = data[, 
    .(prcvol=sum(prcvol), num_stk=uniqueN(permno)), 
    keyby=.(date, industry)]
    
data_tmp[1:2]

A data.table: 2 × 4
date	industry	prcvol	num_stk
<date>	<chr>	<dbl>	<int>
2023-01-03	Accommodation and Food Services	4899714547	64
2023-01-03	Administrative and Support and Waste Management and Remediation Services	8977070704	190

Output Explained:

prcvol: Total price trading volume
num_stk: number of stocks in this industry.

# for each day and each industry, we create two new columns:
# - `is_max_num_stk` indicates if this industry has the largest number of stocks
# - `is_max_prcvol` indicates if this industry has the largest trading volume

data_tmp = data_tmp[, 
    ':='(is_max_num_stk = num_stk == max(num_stk),
         is_max_prcvol = prcvol == max(prcvol)),
    keyby=.(date)]
    
data_tmp[1:2]

A data.table: 2 × 6
date	industry	prcvol	num_stk	is_max_num_stk	is_max_prcvol
<date>	<chr>	<dbl>	<int>	<lgl>	<lgl>
2023-01-03	Accommodation and Food Services	4899714547	64	FALSE	FALSE
2023-01-03	Administrative and Support and Waste Management and Remediation Services	8977070704	190	FALSE	FALSE

# for each day, we select the industry with the largest number of stocks,
# and `is_max_prcvol` will tell us if this industry has the largest trading volume
data_tmp[is_max_num_stk==T][1:3]

A data.table: 3 × 6
date	industry	prcvol	num_stk	is_max_num_stk	is_max_prcvol
<date>	<chr>	<dbl>	<int>	<lgl>	<lgl>
2023-01-03	Manufacturing	120272008757	1271	TRUE	TRUE
2023-01-04	Manufacturing	122987139599	1275	TRUE	TRUE
2023-01-05	Manufacturing	113584385443	1271	TRUE	TRUE

Output Explained:

Each day only has one observation in the final output. It tells us what’s the industry with the largest number of stocks (is_max_num_stk==T), and is_max_prcvol tells us if this industry generates the largest trading volume.
As you can see, “Manufacturing” is the industry with the most stocks, and it generates the largest price trading volume.

Chaining everything together, we can rewrite the above code as:

data[, 
    # compute total trading volume and stock number 
    .(prcvol=sum(prcvol), num_stk=uniqueN(permno)), 
    keyby=.(date, industry)
    ][,
    # add indicators
    ':='(is_max_num_stk = num_stk == max(num_stk),
         is_max_prcvol = prcvol == max(prcvol)),
     keyby=.(date)
    ][
    # only keep the industry with the largest number of stocks
    is_max_num_stk==T][1:3]

A data.table: 3 × 6
date	industry	prcvol	num_stk	is_max_num_stk	is_max_prcvol
<date>	<chr>	<dbl>	<int>	<lgl>	<lgl>
2023-01-03	Manufacturing	120272008757	1271	TRUE	TRUE
2023-01-04	Manufacturing	122987139599	1275	TRUE	TRUE
2023-01-05	Manufacturing	113584385443	1271	TRUE	TRUE

3.9. How many stocks on each day have a gain of more than 5% and a loss more than 5%?#

Similar to Q2, the key is to group the stocks by date and the direction of change.

data[
    # filter stocks
    retx>=0.05 | retx<=-0.05,
    ][, 
    # create `direction` to indicate price change direction
    ':='(direction=ifelse(retx>=0.05, 'up5%+', 'down5%-'))
    ][, 
    # count the stocks
    .(num_stk=uniqueN(permno)),
    keyby=.(date, direction)
    ][1:5]

A data.table: 5 × 3
date	direction	num_stk
<date>	<chr>	<int>
2023-01-03	down5%-	260
2023-01-03	up5%+	250
2023-01-04	down5%-	86
2023-01-04	up5%+	499
2023-01-05	down5%-	313

Execution Explained:

retx>=0.05 | retx<=-0.05: Only select observations with the required retx range. This will reduce the data size for the next step, and hence speed up the execution.
':='(direction=ifelse(retx>=0.05, 'up5%+', 'down5%-')): We use ifelse to indicate the direction: if retx>=0.05 is met, then direction is “up5%+”, otherwise, the value is “down5%-“.

3.10. What are the daily total trading volume of the top 10 stocks with the largest daily increase and the top 10 stocks with the largest daily decrease?#

data[
    # sort by date and retx
    order(date, retx)
    ][, 
    # compute the total trading volumn of the top/bottom 10
    .(bottom10 = sum(prcvol[1:10]), top10 = sum(prcvol[(.N-9):.N])),
    keyby=.(date)
    ][1:5]

A data.table: 5 × 3
date	bottom10	top10
<date>	<dbl>	<dbl>
2023-01-03	77247969	645893614
2023-01-04	414792118	1239987199
2023-01-05	582416153	257844857
2023-01-06	345656464	502542099
2023-01-09	419735680	1452063611

Execution Explained:

We first sort (ascendingly) by date and retx, this facilitates us to select the top/bottom 10 stocks based on retx.
bottom10 = sum(prcvol[1:10]): This line compute the total trading volume for the bottom 10 stocks.
- prcvol[1:10]: It selects the first 10 elements of the prcvol vector. Remember that previously we’ve sorted by retx, so these 10 observations also has the smallest (i.e., most negative) retx value .
- sum(): Compute the total trading volume for these 10 observations.
top10 = sum(prcvol[(.N-9):.N]): Similar to the above, except that we use a special symbol .N. .N is unique to data.table, and it equals to the number of observations in each group. Since we have group the observations by date (keyby=.(date)), then .N equals to the number of observations on each day.
- prcvol[(.N-9):.N]: It selects the last 10 elements of the prcvol vector. Since within each date, the data is sorted ascendingly by retx, so prcvol[(.N-9):.N] is also the 10 observations with the largest (most positive) retx.

3.11. How many stocks increase by 5% at the time of open and close?#

Sometimes, good news is released to the market over the night when the market is closed. In this case, price can only react to the news when the market opens in the next day. Instead, if the news is released during trading hours, then the market will react to it immediately. Let’s count how many stocks fall in these two categories.

data[, 
    # `pre_close` is the previous day's close price
    ':='(pre_close=shift(close, 1)),
    keyby=.(permno)
    ][,
    # create a new column `ret_open` representing the return at the time of open
    ':='(ret_open = open/pre_close -1)
    ][,
    .(num_open=sum(ret_open>=0.05, na.rm=T),
      num_close=sum(retx>=0.05, na.rm=T)),
    keyby=.(date)
    ][1:5]

A data.table: 5 × 3
date	num_open	num_close
<date>	<int>	<int>
2023-01-03	7	250
2023-01-04	56	499
2023-01-05	35	143
2023-01-06	51	390
2023-01-09	93	350

Output Explained:

num_open: Number of stocks that increase by 5% on the time of open, i.e., (“open price” / “prev day’s close”) - 1 >= 0.05
num_close: Number of stocks that increase by 5% at the time of close, i.e., (“close price” / “prev day’s close”) - 1 >= 0.05

Execution Explained:

':='(pre_close=shift(close, 1)): We use the shift function (from data.table) to create a new column representing the previous day’s close price. 1 means the lag is only one period.
':='(ret_open = open/pre_close -1): ret_open is the return on the time of open.
num_open=sum(ret_open>=0.05, na.rm=T): num_open is the number of stocks that increase by 5% on the time of open. The result of ret_open>=0.05 is a boolean vector (whose elements can only be TRUE or FALSE) with the same length as ret_open. sum(ret_open>=0.05, na.rm=T) will add up all the elements in this boolean vector, where TRUE is treated as 1 and FALSE is treated as 0. Basically, what sum(ret_open>=0.05, na.rm=T) does is to count the number of elements in ret_open that are larger than 0.05. na.rm=T is used to ignore NA.

Tip

How to conditionally count elements in a vector? Suppose we have a vector x = c(1, 2, 3, 4 ,5). How to quickly count the elements in x that are no smaller than 3?

sum(x>=3)

3.12. What’s the correlation coefficient between trading volume and return?#

Does more people trading a stock necessarily push the stock price higher or lower? Let’s check this claim!

# Step 1: compute two versions of volume: vol and vol_change
data_q10 = data[order(permno, date)
    ][, .(vol, vol_change = vol - shift(vol), retx, cap), 
        keyby=.(permno)
    ] %>% na.omit()
    
# Step 2: compute corr
data_q10[, 
    # compute corr for individual stocks
    .(corr1=cor(vol, retx), corr2=cor(vol_change, retx), weight=mean(cap)), 
    keyby=.(permno)
    # compute the weighted average of corr
    ][, .(corr1=weighted.mean(corr1, weight, na.rm=T), 
          corr2=weighted.mean(corr2, weight, na.rm=T))]

A data.table: 1 × 2
corr1	corr2
<dbl>	<dbl>
0.03158209	0.005071771

Output Explained:

Well, the correlation between trading volume and return is very low: from 0.005 to 0.03, depending on how the trading volume is defined.

Execution Explained:

In Step 1, we create two versions of trading volume: vol, which is its raw value, and vol_change, which is the change in vol.
We use na.omit (from data.table) to remove any line that contains an NA.
In Step 2, we first compute the corr for each individual stock, resulting in corr1 and corr2. We also create a new column weight which is the average market capital of each stock. weight will be used when we compute the weighted average of corr coefficients.
We use weighted.mean to compute the weighted average of corr.

3.13. What’s the correlation between trading volume and the magnitude of return?#

From Q10 we’ve learned that trading volume and return is rarely correlated. However, trading volume may still be correlated with the magnitude of return: the more people trade a stock, the larger the price movement (either direction) could be. Let’s verify this hypothesis!

We’ll use focus on individual-stock-level (instead of industry-level) this time.

Instead of using retx (return), we’ll use abs(retx) (the absolute value of retx) to represent its magnitude.

data[, 
    # compute the individual-level correlation, also compute the average market cap
    .(corr=cor(vol, abs(retx)), weight=mean(cap)), keyby=.(permno)
    # compute the weighted average of corr across stocks
    ][, weighted.mean(corr, weight, na.rm=T)]

0.460354725740955

Output Explained:

Wow! Now the correlation is much strong: jump from less than 0.05 to 0.46!

3.14. What’s the correlation between industry-level trading volume and industry-level return?#

Q11 is an extension of Q10. Instead of focusing on individual stocks, we aggregate stocks within the same industry. Will such aggregation improve the correlation?

Again, we use the first two digits of naics as a proxy of industry.

data[, 
    # Step 1: create industry-level volume, return and market cap
    ][, .(ind_vol=sum(vol, na.rm=T), ind_ret=weighted.mean(abs(retx), cap, na.rm=T),
          ind_cap=sum(cap, na.rm=T)), 
    keyby=.(industry, date)
    # Step 2: calculate the correlation for each industry
    ][, .(corr=cor(ind_vol, ind_ret), ind_cap=mean(ind_cap)), keyby=.(industry)
    # Step 3: compute the weighted average of corr
    ][, .(corr=weighted.mean(corr, ind_cap, na.rm=T))]

A data.table: 1 × 1
corr
<dbl>
0.2841868

Output Explained:

Well, the correlation is weaker 😂.

Execution Explained:

In Step 1, we compute the industry-level trading volume, return, and market cap on each day
- ind_vol: Industry volume is defined as the sum of all trading volume within the industry.
- ind_ret: Industry return ind_ret is defined as the value weighted average of all stock returns within the industry. We use market capital cap as the weight.
- ind_cap: The industry’s daily aggregated market capital. We’ll use it in the later stage.
In Step 2, we compute the correlation coefficient for each industry. Note that we also create a new ind_cap defined as ind_cap=mean(ind_cap). It’s the average market capital of the industry.
In Step 3, we compute the weighted average of corr. We use ind_cap calculated in Step 3 as the weight.

3.15. What’s the industry distribution of S&P 500 stocks?#

This problem requires us to count the number of S&P 500 stocks in each industry.

data[
    # select the S&P 500 stocks
    is_sp500==T, 
    # count the number of stocks in each industry
    .(num_stk=uniqueN(permno)), 
    # group by industry
    keyby=.(industry)
    ][
    # sort descendingly by the number of stocks
    order(-num_stk)
    ][1:5]

A data.table: 5 × 2
industry	num_stk
<chr>	<int>
Manufacturing	233
Finance and Insurance	78
Information	42
Retail Trade	41
Utilities	35

3.16. What’s the return volatility of each industry?#

We first compute industry-level return on each day and then computes its volatility.

data[, 
    # Step 1: compute industry return on each day
    .(ind_ret=weighted.mean(retx, cap, na.rm=T)), 
    keyby=.(industry, date)
    ][, 
    # Step 2: compute the standard deviation of industry return
    .(volatility=sd(ind_ret)), keyby=.(industry)
    # sort results
    ][order(-volatility)
    ][1:5]

A data.table: 5 × 2
industry	volatility
<chr>	<dbl>
Public Administration	0.01895026
Real Estate and Rental and Leasing	0.01705378
Mining, Quarrying, and Oil and Gas Extraction	0.01577619
Management of Companies and Enterprises	0.01576541
Agriculture, Forestry, Fishing and Hunting	0.01453754

Execution Explained:

Step 1: We first group the data by industry and date, this allows us to compute the industry return on each day. The industry return is calculated as the weighted mean (weighted.mean) of the component stocks, where the weight is stock’s capital market (cap).
Step 2: We define return volatility as the standard deviation (sd) of the return.

3.17. ⭐️ What’s the correlation matrix of different industries?#

# Step 1: calculate the industry-level return
ret_ind = data[, .(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(industry, date)]

# Step 2: pivot the industry return table
ret_ind2 = dcast(ret_ind, date~industry, value.var='ret_mkt')
ret_ind2[, ':='(date=NULL)]  # remove the date column

# Step 3: compute the correlation matrix
cor(ret_ind2)

# -------------------
# The above three steps can be chained together as follows:

# data[, .(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(industry, date)
#     ][, dcast(.SD, date~industry, value.var='ret_mkt')
#     ][, ':='(date=NULL)
#     ][, cor(.SD)]

Show code cell output Hide code cell output

A matrix: 20 × 20 of type dbl
	Accommodation and Food Services	Administrative and Support and Waste Management and Remediation Services	Agriculture, Forestry, Fishing and Hunting	Arts, Entertainment, and Recreation	Construction	Educational Services	Finance and Insurance	Health Care and Social Assistance	Information	Management of Companies and Enterprises	Manufacturing	Mining, Quarrying, and Oil and Gas Extraction	Other Services (except Public Administration)	Professional, Scientific, and Technical Services	Public Administration	Real Estate and Rental and Leasing	Retail Trade	Transportation and Warehousing	Utilities	Wholesale Trade
Accommodation and Food Services	1.0000000	0.7823957	0.5013761	0.7446452	0.6078722	0.4654966	0.7066154	0.6463405	0.6122080	0.5470300	0.7633006	0.5211172	0.6187736	0.6538682	0.2292955	0.6230375	0.6628616	0.6214865	0.4618819	0.7677197
Administrative and Support and Waste Management and Remediation Services	0.7823957	1.0000000	0.4870175	0.7344664	0.7252820	0.5375312	0.7737542	0.7066700	0.7159282	0.6638719	0.8540851	0.5505665	0.7166665	0.7893084	0.3274748	0.7056624	0.7397659	0.7403557	0.4451892	0.7896287
Agriculture, Forestry, Fishing and Hunting	0.5013761	0.4870175	1.0000000	0.5675665	0.4633918	0.3434987	0.5934878	0.5175651	0.1669638	0.4734522	0.4207293	0.6089541	0.5641088	0.3328031	0.3866789	0.2557795	0.3079658	0.4584330	0.3518488	0.5721524
Arts, Entertainment, and Recreation	0.7446452	0.7344664	0.5675665	1.0000000	0.6793245	0.5715538	0.7458602	0.6515004	0.5445807	0.6643853	0.6857226	0.5662939	0.7366153	0.6378005	0.3445050	0.5983349	0.5779041	0.6420355	0.3554395	0.7344718
Construction	0.6078722	0.7252820	0.4633918	0.6793245	1.0000000	0.4681835	0.6700813	0.5874457	0.5866864	0.5886178	0.7405226	0.4275301	0.6750095	0.6570817	0.2870273	0.5991846	0.6947419	0.7139696	0.4077543	0.8034275
Educational Services	0.4654966	0.5375312	0.3434987	0.5715538	0.4681835	1.0000000	0.6033540	0.5072563	0.3430916	0.5183743	0.4615438	0.3931933	0.6922868	0.4145928	0.4475607	0.4808919	0.3831186	0.3970923	0.2684934	0.5589631
Finance and Insurance	0.7066154	0.7737542	0.5934878	0.7458602	0.6700813	0.6033540	1.0000000	0.6958050	0.5172975	0.7934317	0.7461947	0.6772370	0.7397326	0.6393567	0.5000995	0.6278522	0.6119276	0.7016956	0.4507855	0.8321860
Health Care and Social Assistance	0.6463405	0.7066700	0.5175651	0.6515004	0.5874457	0.5072563	0.6958050	1.0000000	0.5391532	0.6388265	0.7180290	0.5375165	0.6963002	0.6681555	0.4151832	0.5667816	0.6341223	0.6324254	0.5353252	0.7489169
Information	0.6122080	0.7159282	0.1669638	0.5445807	0.5866864	0.3430916	0.5172975	0.5391532	1.0000000	0.5551209	0.8110040	0.2294112	0.5045470	0.8297543	0.1700053	0.6506830	0.7794881	0.6005063	0.3602825	0.6048700
Management of Companies and Enterprises	0.5470300	0.6638719	0.4734522	0.6643853	0.5886178	0.5183743	0.7934317	0.6388265	0.5551209	1.0000000	0.7038641	0.5196926	0.6502163	0.6242463	0.4819664	0.5861686	0.6148007	0.6186328	0.4452484	0.6842822
Manufacturing	0.7633006	0.8540851	0.4207293	0.6857226	0.7405226	0.4615438	0.7461947	0.7180290	0.8110040	0.7038641	1.0000000	0.4922993	0.6598291	0.8158554	0.3129434	0.7407663	0.7728147	0.7691517	0.5080767	0.8038981
Mining, Quarrying, and Oil and Gas Extraction	0.5211172	0.5505665	0.6089541	0.5662939	0.4275301	0.3931933	0.6772370	0.5375165	0.2294112	0.5196926	0.4922993	1.0000000	0.5161978	0.3692333	0.3907405	0.3370661	0.3208188	0.5122547	0.3717348	0.5829269
Other Services (except Public Administration)	0.6187736	0.7166665	0.5641088	0.7366153	0.6750095	0.6922868	0.7397326	0.6963002	0.5045470	0.6502163	0.6598291	0.5161978	1.0000000	0.6540319	0.4307723	0.5786889	0.5940287	0.6132417	0.4639425	0.7247291
Professional, Scientific, and Technical Services	0.6538682	0.7893084	0.3328031	0.6378005	0.6570817	0.4145928	0.6393567	0.6681555	0.8297543	0.6242463	0.8158554	0.3692333	0.6540319	1.0000000	0.2446779	0.7704328	0.7969950	0.6911916	0.3839157	0.7094917
Public Administration	0.2292955	0.3274748	0.3866789	0.3445050	0.2870273	0.4475607	0.5000995	0.4151832	0.1700053	0.4819664	0.3129434	0.3907405	0.4307723	0.2446779	1.0000000	0.2233009	0.2342094	0.2491409	0.3779179	0.4332633
Real Estate and Rental and Leasing	0.6230375	0.7056624	0.2557795	0.5983349	0.5991846	0.4808919	0.6278522	0.5667816	0.6506830	0.5861686	0.7407663	0.3370661	0.5786889	0.7704328	0.2233009	1.0000000	0.7053585	0.6030830	0.3046936	0.6524080
Retail Trade	0.6628616	0.7397659	0.3079658	0.5779041	0.6947419	0.3831186	0.6119276	0.6341223	0.7794881	0.6148007	0.7728147	0.3208188	0.5940287	0.7969950	0.2342094	0.7053585	1.0000000	0.6410020	0.4584612	0.7412964
Transportation and Warehousing	0.6214865	0.7403557	0.4584330	0.6420355	0.7139696	0.3970923	0.7016956	0.6324254	0.6005063	0.6186328	0.7691517	0.5122547	0.6132417	0.6911916	0.2491409	0.6030830	0.6410020	1.0000000	0.4938618	0.7756485
Utilities	0.4618819	0.4451892	0.3518488	0.3554395	0.4077543	0.2684934	0.4507855	0.5353252	0.3602825	0.4452484	0.5080767	0.3717348	0.4639425	0.3839157	0.3779179	0.3046936	0.4584612	0.4938618	1.0000000	0.5595027
Wholesale Trade	0.7677197	0.7896287	0.5721524	0.7344718	0.8034275	0.5589631	0.8321860	0.7489169	0.6048700	0.6842822	0.8038981	0.5829269	0.7247291	0.7094917	0.4332633	0.6524080	0.7412964	0.7756485	0.5595027	1.0000000

Execution Explained:

Step 1: Nothing new here.
Step 2: We use cor to generate the correlation matrix. You may be wondering what’s dcast doing here. We use dcast because it “pivots” the table so it can satisfy the requirement of the cor function: each column must represent the returns of one industry.

Let’s print out ret_ind (before dcast) and ret_ind2 (after dcast) to get a sense of what’s dcast doing.

# (before `dcast`)
print(ret_ind)  # column `ret_mkt` stores the returns for *all* industries

#>                           industry       date      ret_mkt
#> 1: Accommodation and Food Services 2023-01-03  0.002703254
#> 2: Accommodation and Food Services 2023-01-04  0.018852215
#> 3: Accommodation and Food Services 2023-01-05 -0.008638118
#> 4: Accommodation and Food Services 2023-01-06  0.023222487
#> 5: Accommodation and Food Services 2023-01-09 -0.001595260

# (after `dcast`)
print(ret_ind2)  # now each column only stores the return of one industry

#>    Transportation and Warehousing    Utilities Wholesale Trade
#> 1:                    0.001654837 -0.004928265    -0.003529841
#> 2:                    0.017020174  0.010504283     0.006552128
#> 3:                   -0.013090004 -0.019094989    -0.011908121
#> 4:                    0.034663817  0.020957761     0.026902487
#> 5:                    0.010606304  0.006563325    -0.002127530

As you can see, dcast reshapes the table from a “long” to a “wide” format.

3.18. ⭐️ What’s the largest/smallest stocks in S&P 500 on each day?#

Here we measure the size of a stock by its market capital. This problem involves some advanced techniques of data.table.

data[
    # Step 1: select the S&P 500 stocks
    is_sp500==T, .(date, ticker, cap)
    # Step 2: sort by date and market cap
    ][order(date, cap)
    # Step 3: select the smallest and largest stocks on each day
    ][, c(type=list(c('smallest', 'largest')), 
         .SD[c(1,.N)]), 
    keyby=.(date)
    ][1:6]

A data.table: 6 × 4
date	type	ticker	cap
<date>	<chr>	<chr>	<dbl>
2023-01-03	smallest	MAYS	9.636480e+04
2023-01-03	largest	AAPL	1.981410e+09
2023-01-04	smallest	YELL	1.305505e+05
2023-01-04	largest	AAPL	2.001847e+09
2023-01-05	smallest	MAYS	9.376416e+04
2023-01-05	largest	AAPL	1.980618e+09

Output Explained:

Each day (e.g., 2023-01-03) has two rows, one for the smallest stock and the other for the largest. The type column indicate the type.

Execution Explained:

Step 1: We select S&P 500 stocks. .(date, ticker, cap) allows us to only include the variable of interests.
Step 2: We sort by date and cap. This is important. Because by doing that, in Step 3 where we group by date, the first observation in each group will be the smallest stock (measured by cap), and the last observation will be the largest stock.
Step 3: We first group by date. .SD[c(1,.N)] will extract the first (1) and the last (.N) row in each group (.N represents the number of rows in the current group). .SD (meaning Subset of a Data table) is a special symbol provided by data.table, and it represents all the data for the current group. Because in Step 2 we already sort by date and cap, .SD[c(1,.N)] corresponds to the smallest and largest stock.
- I use type=list(c('smallest', 'largest')) to add a new column to indicate which row represents the smallest and which row represents the largest. I also use c() to concatenate the newly created type column and the columns returned by .SD.
- You can remove type=list(c('smallest', 'largest')), and the result will be the same, just a bit less reading friendly. See below (see how the type column disappears):

data[
    # Step 1: select the S&P 500 stocks
    is_sp500==T, .(date, ticker, cap)
    # Step 2: sort by date and market cap
    ][order(date, cap)
    # Step 3: select the smallest and largest stocks on each day
    ][, .SD[c(1,.N)], 
    keyby=.(date)
    ][1:6]

A data.table: 6 × 3
date	ticker	cap
<date>	<chr>	<dbl>
2023-01-03	MAYS	9.636480e+04
2023-01-03	AAPL	1.981410e+09
2023-01-04	YELL	1.305505e+05
2023-01-04	AAPL	2.001847e+09
2023-01-05	MAYS	9.376416e+04
2023-01-05	AAPL	1.980618e+09

3.19. What is the ratio of the maximum trading volume to the minimum trading volume in each industry on each day?#

This problem asks us to:

On each day and within each industry, find out the maximum and minimum trading volume.
Compute the ratio.

data[
    # Step 1: sort by date, industry and dollar trading volume
    order(date, industry, prcvol)
    # Step 2: compute ratio for each industry on each day
    ][, .(ratio = prcvol[.N]/prcvol[1]),
    keyby=.(date, industry)
    ][1:5]

A data.table: 5 × 3
date	industry	ratio
<date>	<chr>	<dbl>
2023-01-03	Accommodation and Food Services	98400.5093
2023-01-03	Administrative and Support and Waste Management and Remediation Services	2196617.2703
2023-01-03	Agriculture, Forestry, Fishing and Hunting	828.2466
2023-01-03	Arts, Entertainment, and Recreation	2789.4402
2023-01-03	Construction	12978.0359

Execution Explained:

In Step 1: we sort by date, industry and dollar trading volume. This is important because if we group by date and industry later, the first (last) row in each group will have the smallest (largest) trading volume. This technique is also used in the previous problem, and will be used through this book.
In Step 2: we first group by date and industry, and then compute ratio with ratio = prcvol[.N]/prcvol[1]. As explained earlier, prcvol[.N] (the last row) and prcvol[1] (the first row) correspond to the largest and smallest trading volume.

3.20. What’s the sum of the dollar trading volume for the top 5 stocks in each industry on each day?#

We first find out the top 5 stocks with the largest dollar volume in each industry on each day, then we add them up.

data[
    # Step 1: sort by date, industry and dollar trading volume
    order(date, industry, -prcvol)
    ][, 
    # Step 2: get the total trading volume of the top 5 stocks
    .(prcvol=sum(prcvol[1:5])), keyby=.(date, industry)
    ][1:3]

A data.table: 3 × 3
date	industry	prcvol
<date>	<chr>	<dbl>
2023-01-03	Accommodation and Food Services	2411654851
2023-01-03	Administrative and Support and Waste Management and Remediation Services	4883788253
2023-01-03	Agriculture, Forestry, Fishing and Hunting	241866864

Execution Explained:

Step 1: We first sort by date, industry and -prcvol (-prcvol means sort by prcvol descendingly). By doing that, when we group by date and industry in Step 2, prcvol[1:5] will correspond to the largest 5 values of prcvol.
- Don’t leave out the minus sign in -prcvol. Otherwise, prcvol will be sort ascendingly and hence prcvol[1:5] will correspond to the smallest 5 values of prcvol.
Step 2: Calculate the total trading volume.

3.21. ⭐️ In each industry, what’s the daily average return of stocks whose dollar trading volume exceeds the 80th percentile of the industry?#

data[, 
    # Step 1: only keep rows whose trading volume is in the top 20%
    .SD[prcvol>=quantile(prcvol, 0.8)],
    keyby=.(date, industry)
    # compute the average return
    ][, .(avg_ret=weighted.mean(retx, cap, na.rm=T)),
    keyby=.(date, industry)
    ][1:3]

A data.table: 3 × 3
date	industry	avg_ret
<date>	<chr>	<dbl>
2023-01-03	Accommodation and Food Services	0.003159201
2023-01-03	Administrative and Support and Waste Management and Remediation Services	0.001522516
2023-01-03	Agriculture, Forestry, Fishing and Hunting	0.003461414

Execution Explained:

In Step 1, we first group by date and industry. Next, within each group, we only want to keep rows whose dollar trading volume is larger than the 80th percentile, i.e., lines that satisfy prcvol>=quantile(prcvol, 0.8). The results of prcvol>=quantile(prcvol, 0.8) is a boolean vector with the same length as prcvol, where TRUE means the condition is met while FALSE means not. We use this boolean vector to select the data: .SD[prcvol>=quantile(prcvol, 0.8)]. Remember that .SD (Subset of Data.table) is a special symbol in data.table, and it represents the current group. Basically, what .SD[prcvol>=quantile(prcvol, 0.8)] does is to only keep rows in .SD that satisfy the condition.
In Step 1, we compute the market-capital-weighted average return.

Extension:

If you just want to compute the average return instead of the value-weighted return, the following code is a simpler implementation (note how we use the results of prcvol>=quantile(prcvol, 0.8)] to select retx):

data[, 
    .(avg_ret = mean(retx[prcvol>=quantile(prcvol, 0.8)])),
    keyby=.(date, industry)
    ][1:3]

A data.table: 3 × 3
date	industry	avg_ret
<date>	<chr>	<dbl>
2023-01-03	Accommodation and Food Services	0.001960478
2023-01-03	Administrative and Support and Waste Management and Remediation Services	0.001708134
2023-01-03	Agriculture, Forestry, Fishing and Hunting	0.010825990

3.22. What’s the correlation coefficient between the average return of the top 10% stocks with the largest dollar trading volume and the bottom 10% stocks with the smallest dollar trading volume?#

data_tmp = data[, 
    # Step 1: select the returns of the top/bottom 10% stocks by trading volume
    .(ret_top10=retx[prcvol>=quantile(prcvol, 0.9)],
      cap_top10=cap[prcvol>=quantile(prcvol, 0.9)],
      ret_bottom10=retx[prcvol<=quantile(prcvol, 0.1)],
      cap_bottom10=cap[prcvol<=quantile(prcvol, 0.1)]), 
    keyby=.(date)
    ][, 
    # Step 2: compute the weighted average return on each day
    .(ret_top10=weighted.mean(ret_top10, cap_top10, na.rm=T),
      ret_bottom10=weighted.mean(ret_bottom10, cap_bottom10, na.rm=T)), 
    keyby=.(date)]
    
data_tmp[1:3]  # output of Step 2

A data.table: 3 × 3
date	ret_top10	ret_bottom10
<date>	<dbl>	<dbl>
2023-01-03	-0.004592822	0.0007446199
2023-01-04	0.007174620	0.0066312533
2023-01-05	-0.011002616	-0.0018086251

data_tmp[, 
    # Step 3: compute the correlation coefficient
    cor(ret_top10, ret_bottom10)]

0.620066264125022

Execution Explain:

Step 1: We select the top/bottom 10% stocks by trading volume. Note that we use the same technique used in the previous problem: Frist create a boolean vector with prcvol>=quantile(prcvol, 0.9) and then use it to select the returns with retx[prcvol>=quantile(prcvol, 0.9)]. We also select the market capital cap for the top/bottom 10% stocks because we’ll use them as the weight in Step 2.
Step 2: Compute the value-weighted return for each day.
Step 3: Compute the correlation coefficient.

3.23. ⭐️ Find out the industries whose component stocks have a larger average dollar trading volume than the market average#

To make sure we’re on the same page, my understanding of this problem is that it requires us to:

Calculate the average trading volume of stocks on the market on each day (i.e., we aggregate the trading volume of all stocks on the market and then divide by the number of stocks). This is the “market average.”
Similarly, calculate the average trading volume for stocks within each industry. So every industry will have its own value. This is the “industry average.”
Select all industries whose “industry average” is larger than the “market average.”

data[, 
    # Step 1: compute the "market average"
    .(prcvol_mkt=weighted.mean(prcvol, cap, na.rm=T), industry, prcvol, cap), 
    keyby=.(date)
    # Step 2: compute the "industry average"
    ][, .(prcvol_ind=weighted.mean(prcvol, cap, na.rm=T), prcvol_mkt=prcvol_mkt[1]), 
    keyby=.(date, industry)
    # Step 3: only select industries whose average trading volume is larger than the market average
    ][prcvol_ind>prcvol_mkt
    ][1:5]

A data.table: 5 × 4
date	industry	prcvol_ind	prcvol_mkt
<date>	<chr>	<dbl>	<dbl>
2023-01-03	Information	2849298873	1998086920
2023-01-03	Manufacturing	2940308173	1998086920
2023-01-03	Retail Trade	2219859287	1998086920
2023-01-04	Information	4682660169	2103339254
2023-01-04	Manufacturing	2634129565	2103339254

Output Explained:

Take “2023-01-03” for example. This date has three rows (1-3), each representing an industry that meet our condition. prcvol_ind is the industry average dollar volume and prcvol_mkt is the market average dollar volume.

Execution Explained:

Step 1: We first group by date, because we want to include all stocks on each day to compute the market average. .(prcvol_mkt=weighted.mean(prcvol, cap, na.rm=T), industry, prcvol, cap) does two things:
- It includes industry, prcvol and cap in the output. These three columns are presented in the original data. We include them because we’ll need them in Step 2.
- It also creates a new column procvol_mkt, which is the weighted average dollar trading volume of the market.
Step 2: We next group by date and industry, because we want to calculate the industry average. The tricky thing is prcvol_mkt=prcvol_mkt[1], why we need to explicitly select the first element ([1])? Let’s look at the full line of code:
- .(prcvol_ind=weighted.mean(prcvol, cap, na.rm=T), prcvol_mkt=prcvol_mkt[1]). This line created two columns: prcvol_ind and prcvol_mkt. For prcvol_ind, the result is a scalar (a single number). However, prcvol_mkt is a constant vector, meaning its elements are identical. This is understandable, for each industry, the “market average” should be the same. The problem is, if you put a scaler and a constant together in the output, the result will be “broadcasted,” meaning the scaler will be repeated until it has the same length as the vector. This is not desirable, and that’s why we also need to reduce prcvol_mkt to a scaler by prcvol_mkt=prcvol_mkt[1].

Avoid Unnecessary Broadcasting

When using data.table, you need to pay special attention to the size of the output. Techniques like prcvol_mkt=prcvol_mkt[1] are very useful to avoid unnecessary broadcasting.

If we don’t reduce prcvol_mkt to a scaler, the code and output will become like this (note that many duplicates are introduced):

data[, 
    # Step 1: compute the "market average"
    .(prcvol_mkt=weighted.mean(prcvol, cap, na.rm=T), industry, prcvol, cap), 
    keyby=.(date)
    # Step 2: compute the "industry average"
    ][, .(prcvol_ind=weighted.mean(prcvol, cap, na.rm=T), prcvol_mkt=prcvol_mkt), 
    keyby=.(date, industry)
    ][1:5]

A data.table: 5 × 4
date	industry	prcvol_ind	prcvol_mkt
<date>	<chr>	<dbl>	<dbl>
2023-01-03	Accommodation and Food Services	407673816	1998086920
2023-01-03	Accommodation and Food Services	407673816	1998086920
2023-01-03	Accommodation and Food Services	407673816	1998086920
2023-01-03	Accommodation and Food Services	407673816	1998086920
2023-01-03	Accommodation and Food Services	407673816	1998086920

3.24. ⭐️ What’s the daily “abnormal return” of each stock?#

In finance, the daily stock return is conceptually viewed as the sum of two parts: a “baseline return,” which is the part that can be explained by some known factors (e.g., firm characteristics), and an “abnormal return” that cannot. As you can see, abnormal return is defined against how the baseline return is computed. In this example, we use the “market model” to calculate the baseline return, and the abnormal return is calculated as the raw stock return less the baseline return:

Assume $r_{i t}$ is the raw daily return for stock $i$ on day $t$ ; $r_{t}^{m k t}$ is the market return on day $t$ . We first run the following regression:

r_{i t} = α_{i} + β_{i} r_{t}^{m k t} + ε_{i t}

The $α_{i} + β_{i} r_{t}^{m k t}$ part is called the baseline return, because it’s something that can be explained by the coefficients of the regression. Abnormal return $A R_{i t}$ is hence defined as:

A R_{i t} = r_{i t} - (α_{i} + β_{i} r_{t}^{m k t})

which is exactly the $ε$ of the regression.

data[,
    # Step 1: construct the market return
    ':='(ret_mkt=weighted.mean(retx, cap, na.rm=T)),
    keyby=.(date)
    ][, 
    # Step 2: only keep stocks with at least 50 observations
    .SD[.N>=50], keyby=.(permno)
    ][
    # (optional) remove missing values
    !is.na(retx)
    ][,
    # Step 3: get alpha, beta, and AR
    {
        lm_out = lm(retx ~ ret_mkt)
        alpha = coef(lm_out)[1]
        beta = coef(lm_out)[2]
        ar = resid(lm_out)
        list(date, ar, alpha, beta)
    },
    keyby=.(permno)
    ][1:3]

A data.table: 3 × 5
permno	date	ar	alpha	beta
<dbl>	<date>	<dbl>	<dbl>	<dbl>
10026	2023-01-03	0.012975042	7.201256e-05	0.2987578
10026	2023-01-04	-0.004386641	7.201256e-05	0.2987578
10026	2023-01-05	-0.007474872	7.201256e-05	0.2987578

Execution Explained:

Step 1: We construct a new column ret_mkt to capture the daily value-weighted return of the whole market.
Step 2: We remove any stock that has less than 50 observations (days) in the sample. (If there’re too few observations, the regression results in Step 3 will not be robust.)
- We first group by permno, because the operation is for individual stocks.
- You should be familiar with .SD now, which represents the current group, i.e., the whole data for the current permno.
- .N>=50: .N is the number of rows in the current group, we require it to be no less than 50.
Step 3: The whole operations are wrapped in curly brackets, because we want to reuse what we’ve created in the j part (we also used this technique in this problem).
- We first group by permno, because the regression is for individual stocks.
- We then regress the daily return on the market return, lm(retx ~ ret_mkt), and save the result to lm_out.
- We use coef(lm_out) to extract the coefficients of the regression results. The output is a list, with the first element being the intercept and the second element being the beta.
- We also use resid(lm_out) to extract the residuals. That’s exactly the abnormal return we’re looking for.
- At last, we specify what variables we want to put in the final output: list(date, ar, alpha, beta)

3.25. ⭐️ What’s the daily “self-excluding” industry return?#

Let me explain the requirements of the problem. Suppose on day t, there’re three industries: A, B, and C, and their daily returns are “ret_A,” “ret_B” and “ret_A.” The problem asks us to calculate three versions of daily industry return: 1) the average of ret_A and ret_B, 2) the average of ret_B and ret_C, and 3) the average of ret_A and ret_C.

Specifically, for industry $i$ on day $t$ , its daily self-excluding return $r e t_{i t}$ is defined as:

r e t_{i t} = \sum_{j = 1, j \neq i}^{N} {\frac{i n d_c a p_{j t} \times i n d_r e t_{j t}}{\sum_{j = 1, j \neq i}^{N_{j}} i n d_c a p_{j}}}

where

$N$ : total number of industries
$N_{j}$ : total number of stocks in industry $j$ .
$i n d_c a p_{j t}$ : daily market capital of industry $j$ .
$i n d_r e t_{j t}$ : daily return of industry $j$ .

We’re going to approach this problem as follows:

First, compute the daily average return for each industry.
- We define industry average return as the value-weighed return of its component stocks. We’ve done this in this problem.
Second, calculate the self-excluding return.

# First, compute the daily average return for each industry
# For the definition of `ind_ret` and `ind_cap`, see the above explanation
ind_rets = data[, 
    .(ind_ret=weighted.mean(retx, cap, na.rm=T),
      ind_cap=sum(cap, na.rm=T)),
    keyby=.(date, industry)]

ind_rets[1:3]

A data.table: 3 × 4
date	industry	ind_ret	ind_cap
<date>	<chr>	<dbl>	<dbl>
2023-01-03	Accommodation and Food Services	0.002703254	722345679
2023-01-03	Administrative and Support and Waste Management and Remediation Services	0.001455542	1470427710
2023-01-03	Agriculture, Forestry, Fishing and Hunting	0.002476840	48613819

# Then, calculate the self-excluding return
ind_rets[,
    {
        # Step 1: create an empty vector to store the results
        self_ex_ret = vector()

        # Step 2: for each industry, compute the self-excluding return
        for (i in 1:.N) {
            self_ex_ret[i] = weighted.mean(ind_ret[-i], ind_cap[-i], na.rm=T)
        }

        # Step 3: return the results
        list(industry, self_ex_ret)
    },
    keyby=.(date)
    ][1:3]

A data.table: 3 × 3
date	industry	self_ex_ret
<date>	<chr>	<dbl>
2023-01-03	Accommodation and Food Services	-0.004230309
2023-01-03	Administrative and Support and Waste Management and Remediation Services	-0.004322716
2023-01-03	Agriculture, Forestry, Fishing and Hunting	-0.004106102

Execution Explained:

Note that we wrap the whole j part in curly brackets again.

Step 1: We initialize the self-excluding return self_ex_ret as an empty vector (this helps to speed up the loop execution in Step 2).
Step 2: We loop over each industry (from 1 to .N). Note that we’ve already grouped by date, so .N is exactly the number of industries in each group.
- ind_ret[-i] represents the return vector ind_ret with the i-th element removed.
- ind_cap[-i] represents the market capital vector ind_cap with the i-th element removed.

3.26. ⭐️ How many industries exceed the market average return on each day?#

data[,
    # Step 1: compute the market return
    ':='(ret_mkt=weighted.mean(retx, cap, na.rm=T)),
    keyby=.(date)
    ][,
    # Step 2: compute the industry return
    .(ind_ret=weighted.mean(retx, cap, na.rm=T), ret_mkt=ret_mkt[1]),
    keyby=.(date, industry)
    ][
    # Step 3: calculate the number of stocks exceeding the market average
    ind_ret>ret_mkt,
    .(num_stk=.N, industry=list(industry)),
    keyby=.(date)
    ][1:2]

A data.table: 2 × 3
date	num_stk	industry
<date>	<int>	<list>
2023-01-03	14	Accommodation and Food Services , Administrative and Support and Waste Management and Remediation Services, Agriculture, Forestry, Fishing and Hunting , Arts, Entertainment, and Recreation , Construction , Educational Services , Finance and Insurance , Health Care and Social Assistance , Information , Other Services (except Public Administration) , Real Estate and Rental and Leasing , Retail Trade , Transportation and Warehousing , Wholesale Trade
2023-01-04	15	Accommodation and Food Services , Administrative and Support and Waste Management and Remediation Services, Agriculture, Forestry, Fishing and Hunting , Arts, Entertainment, and Recreation , Construction , Educational Services , Health Care and Social Assistance , Management of Companies and Enterprises , Manufacturing , Mining, Quarrying, and Oil and Gas Extraction , Other Services (except Public Administration) , Professional, Scientific, and Technical Services , Real Estate and Rental and Leasing , Transportation and Warehousing , Utilities

Execution Explained:

Step 1 and Step 2: Calculate the industry- and market-level return. We’ve done this many times in previous problems.
Step 3: Calculate the number of stocks exceeding the market average
- ind_ret>ret_mkt (the i part) only selects industries that exceed the market average.
- num_ind=.N counts the number of industry.
- industry=list(industry). This part is interesting. It basically “packs” the multiple elements in a vector into a single element of a list, avoiding duplicated lines in the output. To understand that, recall that industry is a vector that contains all the industries that meet our criteria in the given day. To data.table, each element of industry will be expanded to a separate row. However, we only want to keep one row for each day in the output. By wrapping industry using list, we collect all the industries into a single cell.

Tip

var = list(var) or var = unique(list(var)) is a frequently used technique to pack the otherwise-occupying-many-lines output into a single row. This is especially useful when you want your output to be a set.

3.27. ⭐️ How many stocks have experienced at least one daily price increase of 5%+ in the past three days? (Count the number on each day.)#

To be clear, we define what’s “past three days” as follows: on day t, the previous three days are [t-3, t-2, t-1]. Of course you can have different definition as mine, but the solution is the same.

The challenge in Q9 is that we need to do rolling count. On day t, we need to query the values from (t-3):(t-1).

Warning

Q9 is difficult and it requires some advanced techniques of data.table. However, as you can see below, the core of the code is only about five lines. I can image that the code could easily exceed dozens of lines if you use dplyr or pandas.

I hope that after reading Q9, you’ll get an appreciation of how amazing and efficient the syntax of data.table is.

data[
    # Step 1: convert the permno into a list
    retx>=0.05, .(permno_list=list(permno)), keyby=.(date)
    ][, {
    # Step 2: compute the number of stocks that open up 5% or more
    num_stk_3d = vector()
    for (i in 4:.N) {
        num_stk_3d[i] = permno_list[(i-3):(i-1)] %>% 
            unlist() %>% unique() %>% length()
    }
    list(date, num_stk_3d)
    }
    ][!is.na(num_stk_3d)
    ][1:3]

A data.table: 3 × 2
date	num_stk_3d
<date>	<int>
2023-01-06	738
2023-01-09	867
2023-01-10	747

Execution Explained:

Step 1: We first select stocks with daily return larger than 5% (retx>=0.05), then we create a new column permno_list which represents all permno in a given day as a list (see the output below).
- Converting to a list is important because it allows us to collect all the stocks across different days and run deduplication in Step 2.
Step 2: We compute the number of stocks that open up 5% or more in the previous three days.
- num_stk_3d = vector(): We first create an empty vector, it will store all the permnos (stock IDs) that open up 5% or more.
- for (i in 4:.N)...: This is the key part of the code. Let’s take i to be 4 and see what it does.
  - permno_list[(i-3):(i-1)] becomes permno_list[1:3], that is, the stock IDs in the first three days. Note that permno_list[1:3] is a list of three elements, where each element represents a day.
  - %>% unlist(): The unlist function “unpacks (flatterns)” permno_list[1:3] so the result becomes a non-nested list of stock IDs.
  - %>% unique(): Remove any duplicated stock IDs.
  - %>% length(): Count the number of stocks.
  - As you can see, when i equals 4, we count from day 1 to 3. Similarly when i equals 5, we count from day 2 to 4.
- list(date, num_stk_3d): Collect the final output columns and pack them into a list. This is required by data.table if the j part is wrapped by curly brackets {...} (if you don’t know what’s the j part please see the introduction of data.table). You can think of this line as a return command but with the return keyword hidden.

3.28. ⭐️ How many stocks outperform the market average return on all of the past three days?#

This problem is very much like the one we’ve just solved. The difference is that in the previous problem, stocks that meet the criteria in any of the past three days will be included, but this time we require that the stocks have to meet the criteria on all* of the past three days.

Try to do it yourself before continuing.

data[
    # Step 1: Compute the market average return
    , ':='(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(date)
    ][
    # Step 2: Convert the stocks meeting the criteria into a list
    retx>=ret_mkt, .(permno_list=list(permno)), keyby=.(date)
    ][, {
    # Step 3: compute the number of stocks that meet the criteria
    num_stk_3d = vector()
    for (i in 4:.N) {
        num_stk_3d[i] = Reduce(
            intersect, 
            permno_list[(i-3):(i-1)]
            ) %>% length()
    }
    list(date, num_stk_3d)
    }
    ][!is.na(num_stk_3d)
    ][1:3]

A data.table: 3 × 2
date	num_stk_3d
<date>	<int>
2023-01-06	676
2023-01-09	519
2023-01-10	456

3.29. ⭐️ How many stocks outperform the market at least for two out of the past three days?#

Now let’s make the problem harder!

data[
    # Step 1: Compute the market average return
    , ':='(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(date)
    ][
    # Step 2: Convert the stocks meeting the criteria into a list
    retx>=ret_mkt, .(permno_list=list(permno)), keyby=.(date)
    ][, {
    # Step 3: compute the number of stocks that meet the criteria
    num_stk_3d = vector()
    for (i in 4:.N) {
        num_stk_3d[i] = permno_list[(i-3):(i-1)] %>% unlist() %>% 
            table() %>% .[.>=2] %>% 
            length()
    }
    list(date, num_stk_3d)
    }
    ][!is.na(num_stk_3d)
    ][1:3]

A data.table: 3 × 2
date	num_stk_3d
<date>	<int>
2023-01-06	2268
2023-01-09	1951
2023-01-10	1922

As you can see, the key change is in Step 3, where we use

permno_list[(i-3):(i-1)] %>% unlist() %>% 
            table() %>% .[.>=2]

The above lines of code find out stocks that at least appear two times in the previous three days. To better understand how it works, let’s look at a simpler example:

# suppose we have a list with three elements, each elements is a vector containing letters
l = list(c('a', 'b', 'c'), c('a', 'b', 'f'), c('a', 'h', 'i'))

# Now we want to find out the letters that appear at least two times in the list
l %>% unlist() %>% table() %>% .[.>=2] %>% names()
#> [1] "a" "b"

# To understand the above line, let's break it down:

l %>% unlist()  # unpack
#> [1] "a" "b" "c" "a" "b" "f" "a" "h" "i"

l %>% unlist() %>% table()  # count frequency of each letter
#> .
#> a b c f h i 
#> 3 2 1 1 1 1

l %>% unlist() %>% table() %>% .[.>=2]  # only keep frequencies >= 2
#> .
#> a b
#> 3 2

l %>% unlist() %>% table() %>% .[.>=2] %>% names()
#> [1] "a" "b"

If you understand the above example, you won’t have any problem in understanding the answer key.

3.30. In each month, how many stocks outperform the market?#

Let’s make sure our understanding of this problem is on the same page: This problem requires us to compare the monthly return of the stocks and the market, not daily.

In the following solution, I put the calculation of monthly “stock” and “market” returns in two separate cells. It’s because the computation requires us to aggregate the data on different levels: to compute market return, we only aggregate on year-month, but for stock return, we need to aggregate on year-month as well as on the individual stock level. So it’s better to process them separately.

#--- Calculate the monthly "market" return ---#

mon_ret_mkt = data[
    # Step 0: sort by permno and date
    order(permno, date), .(permno, date, retx, cap)
    ][,
    # Step 1: create the daily market return
    .(ret_mkt=weighted.mean(retx, cap, na.rm=T),
      yearmon=floor_date(date, 'month')), 
    keyby=.(date)
    ][,
    # Step 2: create the monthly market return
    .(mon_ret_mkt=prod(1+ret_mkt)-1),
    keyby=.(yearmon)
    ]

mon_ret_mkt[1:3]

A data.table: 3 × 2
yearmon	mon_ret_mkt
<date>	<dbl>
2023-01-01	0.07608082
2023-02-01	-0.01650494
2023-03-01	0.03510999

Execution Explained:

Most of the above code shouldn’t be unfamiliar to you by now. There’re two places worth noting:

In Step 1, we create a month indicator by using the floor_date function from the lubridate package. floor_date(date, 'month') will floor every date to the beginning of the month (note here we specify the rounding unit to “month”), e.g., “2023-05-23” will be floored to “2023-05-01.” This quickly creates a month indicator.
In Step 2, we use prod(1+ret_mkt)-1 to compute the monthly market return. prod is essentially $\prod$ , which multiply all elements in the given vector. Because ret_mkt is the daily market return, the result of prod(1+ret_mkt) is the net value at the end of the month. We take 1 from the result, yielding the market return over the past month.

#--- Calculate the monthly "stock" return and count the required stocks ---#

data[
    # Step 0: sort by permno and date
    order(permno, date), .(permno, date, retx, cap)
    ][,
    # Step 1: create a month indicator
    ':='(yearmon=floor_date(date, 'month'))
    ][,
    # Step 2: calculate the monthly return for stocks
    .(mon_ret_stk=prod(1+retx)-1),
    keyby=.(yearmon, permno)
    ][
    # Step 3: merge the market return to the stock return
    mon_ret_mkt, on=.(yearmon), nomatch=NULL
    ][,
    # Step 4: count the stocks that beat the market
    .(num_stk=sum(mon_ret_stk>mon_ret_mkt, na.rm=T)),
    keyby=.(yearmon)
    ][1:3]

A data.table: 3 × 2
yearmon	num_stk
<date>	<int>
2023-01-01	2140
2023-02-01	1897
2023-03-01	843

Execution Explained:

Step 1 to 2 calculate the monthly stock return. It’s very similar to how we calculate the monthly market return previously.
Step 3: We merge the market return (which we just computed) to the stock return. The key used to join is yearmon.
Step 4: We count the stocks that beat the market.

Tip

There’re other methods to create a month indicator. For example, we can create a string by extracting the year and month part of a date, something like

x = as.Date('2023-03-12')

strftime(x, '%Y-%m')
#> [1] "2023-03"

However, I personally didn’t recommend this method for the following reason:

Representing a date using a string is slow, because CPU is much faster processing numbers (the Date object is represented as a number in R) than strings.
It creates many troubles when you need to visualize your data (i.e., plotting). Visualization packages treat strings as discrete, unrelated observations, so “2023-01-01,” “2023-02-01,” “2023-06-01” will all be evenly spaced on the axis, while actually their gaps are one month and four month, respectively.

3.31. Find out the three stocks that’re mostly correlated with the market#

We first compute, for each stock, the correlation coefficient between the stock return and the market return. Then we select the top 10.

data[,
    # Step 1: compute the market return
    ':='(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(date)
    ][,
    # Step 2: calculate the correlation coefficient
    .(cor_coef=cor(retx, ret_mkt), ticker=ticker[1]), keyby=.(permno)
    ][
    # Step 3: sort and select the top 10
    order(-cor_coef)][1:3]

A data.table: 3 × 3
permno	cor_coef	ticker
<dbl>	<dbl>	<chr>
15407	1	SPNE
15542	1	VTVT
24009	1	GENK

Execution Explained:

No new techniques are introduced in this problem. The only thing you need to pay attention is ticker=ticker[1] in Step 1. We only keep the first element of ticker to avoid outputting duplicated rows. Refer to this problem for details.

3.32. Find out the stock that has the largest monthly decrease in trading volume#

My understanding of the problem: at the end of every month, we compute the percentage change in monthly trading volume for each stock, than we select the one with the largest decrease.

data[, 
    # Step 1: calculate the montly trading volume
    # (also create the month indicator)
    .(mon_prcvol=sum(prcvol)),
    keyby=.(industry, yearmon=floor_date(date, 'month'))
    ][,
    # Step 2: calculate the percentage change in trading volume
    .(yearmon, mon_prcvol_change=mon_prcvol/shift(mon_prcvol)-1),
    keyby=.(industry), 
    ][, 
    # Step 3: remove NA
    na.omit(.SD)
    ][
    # Step 3: pick the largest decrease
    order(yearmon, -mon_prcvol_change),
    .SD[1],
    keyby=.(yearmon)
    ][1:3]

A data.table: 3 × 3
yearmon	industry	mon_prcvol_change
<date>	<chr>	<dbl>
2023-02-01	Agriculture, Forestry, Fishing and Hunting	0.3305681
2023-03-01	Finance and Insurance	0.7379635
2023-04-01	Arts, Entertainment, and Recreation	-0.1556650

Execution Explained:

Step 1: Calculate monthly trading volume by aggregating the daily trading volume in the month. Note how we create the monthly indicator yearmon in the by part! (Yes, you can create new columns in the by part on the fly!) Of course, you can first create yearmon and then group by it, but isn’t it cool that data.table can do two things in one shot?
Step 2: Calculate monthly change. Note how we use data.table::shift to create a lag term of mon_prcvol.
Step 3: Pick the largest decrease. Step 3 is executed in the following order:
- order(yearmon, -mon_prcvol): Sort by yearmon (ascendingly) and mon_prcvol (descendingly).
- keyby=.(yearmon): Then group by yearmon. Because we’ve already sorted the data, now the first row in each group has the largest decrease. Similar technique is also used in this problem.
- .SD[1]: Select the first row in each group.

3.33. ⭐️ What’s the maximum drawdown of each stock?#

“A drawdown is a peak-to-trough decline during a specific period for an investment, trading account, or fund. A drawdown measures the historical risk of different investments, compares fund performance, or monitors personal trading performance.”

– Invespedia

The maxinum drawdown of a stock is the largest decline of all the drawdowns in during the period. We’ll use the table.Drawdowns function from the PerformanceAnalytics package to compute the maximum drawdowns.

drawdowns <- function(ret, date, top=5) {
    # This function calculates the drawdowns for a given return series
    # Input:
    #   ret: a vector of returns
    #   date: a vector of dates
    #   top: the number of largest drawdowns to be returned

    input = data.table(date, ret)[order(date)]

    # a drawdown is defined whenever the cumulative return is below the previous peak
    drawdowns_raw = input[, {
        ret_cum=cumprod(1 + ret)
        ret_cummax=cummax(c(1, ret_cum))[-1]
        drawdowns=ret_cum/ret_cummax - 1
        list(date, ret, ret_cum, ret_cummax, drawdowns)
    }]

    # find the largest drawdowns
    drawdowns = drawdowns_raw[, ':='(grp = rleid(drawdowns<0))
        ][drawdowns<0, 
        .(start=date[1], trough=date[which.min(drawdowns)],
          end=date[.N], depth=drawdowns[which.min(drawdowns)]),
        keyby=.(grp)
        ][order(depth)
        ][1:top
        ][, ':='(grp=NULL)]
}

data[
    # Step 1: sort by permno, date
    order(permno, date)
    ][, 
    # Step 2: calculate drawdowns for each stock
    {drawdowns(retx, date, top=1)},
    keyby=.(permno)
    ][1:3]

A data.table: 3 × 5
permno	start	trough	end	depth
<dbl>	<date>	<date>	<date>	<dbl>
10026	2023-01-17	2023-02-01	2023-04-28	-0.1009774
10028	2023-03-13	2023-05-01	2023-06-30	-0.2337500
10032	2023-01-25	2023-05-04	2023-06-30	-0.2609265

data[, .(mean(retx), sd(retx))]

A data.table: 1 × 2
V1	V2
<dbl>	<dbl>
0.001109345	0.05221605

(1/8)/(1/15.1)-1

0.8875

suppressWarnings({
    data[
    # Step 1: sort by permno, date
    order(permno, date)
    ][, 
    # Step 2: calculate drawdowns for each stock
    {
        ret_zoo = zoo(retx, order.by=date)
        PerformanceAnalytics::table.Drawdowns(ret_zoo, top=1)[, c("From", "Trough", "Depth")]
    },
    keyby=.(permno)
    ][1:3]
})

A data.table: 3 × 4
permno	From	Trough	Depth
<dbl>	<date>	<date>	<dbl>
10026	2023-01-17	2023-02-01	-0.1010
10028	2023-03-13	2023-05-01	-0.2337
10032	2023-01-25	2023-05-04	-0.2609

Output Explained:

From: the start date of the maximum drawdown.
Trough: the date of the trough of the maximum drawdown.
Depth: the percentage decline, i.e., the magnitude of drawdown.

For more details about the From, Trough, Depth and perhaps other possible outputs, please refer to the documentation of the table.Drawdowns.

Execution Explained:

Step 1: Sort by permno and date.
Step 2: Calculate drawdowns for each stock.
- We group by permno, because the drawdown is calculated for each stock.
- table.Drawdowns requires the input to be a zoo or xts object, so we create a zoo object with ret_zoo = zoo(retx, order.by=date). The resulting ret_zoo object contains the daily stock return retx, and its row index represents the date date.
- PerformanceAnalytics::table.Drawdowns(): We call the table.Drawdowns() function. Note that we explicitly tell R that the functdion is from the PerformanceAnalytics package by adding the PerformanceAnalytics:: prefix. This is sometimes convenient especially when we only want to use one function without librarying the whole package.

3.34. What’s the win (success) ratio of each stock?#

The win/loss, or success ratio, is a trader’s number of winning trades divided by the number of losing trades. When it comes to the stock like in this problem, it’s the ratio of positive-return days during the period.

data[
    # Step 0: sort by permno, date
    order(permno, date),
    # Step 1: calculate the win ratio
    .(win_ratio=sum(retx>0)/.N),
    keyby=.(permno)
    ][1:3]

A data.table: 3 × 2
permno	win_ratio
<dbl>	<dbl>
10026	0.5322581
10028	0.5483871
10032	0.5080645

Execution Explained:

Step 1: Calculate the win ratio. sum(rex>0) gives us the number of days with positive returns. We then divide it by the total number of days (.N), which gives us the win ratio.

3.35. What’s the win ratio of the market?#

This problem is very similar to the previous one.

data[,
    # Step 1: Calculate the market daily return
    .(ret_mkt=weighted.mean(retx, cap, na.rm=T)),
    keyby=.(date)
    ][,
    # Step 2: Calculat the win ratio
    .(win_ratio=sum(ret_mkt>0)/.N)
    ]

A data.table: 1 × 1
win_ratio
<dbl>
0.5806452

Output Explained:

Hmm… It seems there’re more gaining days than losing days.

3.36. What’s the win ratio of each industry?#

Try it yourself before continuing to read!

data[,
    # Step 1: calculate the industry return
    .(ind_ret=weighted.mean(retx, cap, na.rm=T)),
    keyby=.(date, industry)
    ][,
    # Step 2: calculate the win ratio
    .(win_ratio=sum(ind_ret>0)/.N),
    keyby=.(industry)
    ][1:3]

A data.table: 3 × 2
industry	win_ratio
<chr>	<dbl>
Accommodation and Food Services	0.5806452
Administrative and Support and Waste Management and Remediation Services	0.5161290
Agriculture, Forestry, Fishing and Hunting	0.5403226

3.37. What’s the profit/loss ratio of each stock?#

In this problem, the profit/loss ratio is defined as the average return on winning days (i.e., days with positive returns) divided by the average loss on losing days (i.e., days with negative returns):

r a t i o = (\frac{1}{# w i n n i n g d a y s} \sum_{t \in w i n n i n g d a y s} r e t_{t}) / (\frac{1}{# l o s i n g d a y s} \sum_{t \in l o s i n g d a y s} r e t_{t})

Profit/loss ratio can also be used to evaluate the performance of a trader, see details here.

data[, 
    {
        # Step 1: compute the variables
        avg_win=mean(retx[retx>0])
        avg_loss=mean(retx[retx<0])
        ratio=abs(avg_win/avg_loss)

        # Step 2: collect the outputs
        list(avg_win=avg_win, avg_loss=avg_loss, ratio=ratio)
    },
    keyby=.(permno)
    ][1:3]

A data.table: 3 × 4
permno	avg_win	avg_loss	ratio
<dbl>	<dbl>	<dbl>	<dbl>
10026	0.009876289	-0.01007681	0.9801012
10028	0.020682302	-0.02011190	1.0283615
10032	0.012812238	-0.01370729	0.9347023

Execution Explained:

Step 1: Calculate the average win/loss and the profit/loss ratio. Note that we use retx[retx>0] to extract the positive elements in retx.
Step 2: We collect the outputs into a list.

3.38. What’s the profit/loss ratio of the market?#

Very similar to the previous one.

data[, 
    # Step 1: calculate the market return
    .(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(date)
    ][,
    # Step 2: calculate the win ratio
    {
        avg_win=mean(ret_mkt[ret_mkt>0])
        avg_loss=mean(ret_mkt[ret_mkt<0])
        ratio=abs(avg_win/avg_loss)
        list(avg_win, avg_loss, ratio)
    }
    ]

A data.table: 1 × 3
avg_win	avg_loss	ratio
<dbl>	<dbl>	<dbl>
0.007913403	-0.007194593	1.09991

Output Explained:

Hmm… At least in the sample period, the market is going up, as shown by the ratio being larger than one.

3.39. Is there a situation where the monthly trading volume of a stock exceeds the total daily trading volume of its industry in the same month?#

While in real world it’s rare that we’re asked to answer such a question, but this problem is a good test of your data wrangling skills anyway.

This problem is a mouthful, let’s make sure we’re on the same page: To solve this problem, we first compute the monthly trading volume of each stock. We also calculate the minimum daily trading volume of the industry that this stock belongs to in the same month. At last, we compare the stock’s monthly trading volume with the industry’s minimum daily trading volume.

stk_mon_prcvol = data[,
    # Step 1: calculate the stocks' monthly trading volume
    .(stk_mon_prcvol=sum(prcvol), industry=industry[1]),
    keyby=.(permno, yearmon=floor_date(date, 'month'))]


ind_mon_prcvol = data[,
    # Step 2: calculate the industry's daily trading volume
    .(ind_daily_prcvol=sum(prcvol)),
    keyby=.(industry, date)
    ][,
    # Step 3: find the mininum daily trading volume for each month
    .(ind_min_daily_prcvol=min(ind_daily_prcvol)),
    keyby=.(industry, yearmon=floor_date(date, 'month'))
    ]

stk_mon_prcvol[
    # Step 4: join the two tables
    ind_mon_prcvol, on=.(industry, yearmon), nomatch=NULL
    ][
    # Step 4: find out the stocks meeting our criteria
    stk_mon_prcvol>ind_min_daily_prcvol
    ][1:3]

A data.table: 3 × 5
permno	yearmon	stk_mon_prcvol	industry	ind_min_daily_prcvol
<dbl>	<date>	<dbl>	<chr>	<dbl>
14338	2023-01-01	4220384337	Accommodation and Food Services	4181358836
43449	2023-01-01	14546642335	Accommodation and Food Services	4181358836
77702	2023-01-01	13580162865	Accommodation and Food Services	4181358836

Execution Explained:

Step 1: calculate the stocks’ monthly trading volume. Note how we create a year-month indicator (yearmon) in the by part. The same technique is also used in, for example, this problem.
Step 2: calculate the industry’s daily trading volume.
Step 3: find the minimum daily trading volume of each industry in a month.
Step 4: join the two tables: “monthly stock trading volume” and “minimum industry daily trading volume in each month”.
Step 5: select the stocks that meet our criteria.

3.40. What’s the standard deviation of returns for stocks in each industry?#

Problem Explained:

This problems requires us to:

First, calculate the standard deviation of returns for each stocks.
Second, calculate the industry-level standard deviation, which is defined as the value-weighed average of the standard deviations of all its component stocks.

data[,
    # Step 1: calculate the return sd for each stock
    .(ret_sd=sd(retx), industry=industry[1], cap=mean(cap)),
    keyby=.(permno)
    ][,
    # Step 2: calculate the industry-level average sd.
    .(ret_sd=weighted.mean(ret_sd, cap, na.rm=T)),
    keyby=.(industry)
    ][1:3]

A data.table: 3 × 2
industry	ret_sd
<chr>	<dbl>
Accommodation and Food Services	0.01466278
Administrative and Support and Waste Management and Remediation Services	0.01664406
Agriculture, Forestry, Fishing and Hunting	0.01639743

Execution Explained:

Step 1: calculate the return sd for each stock. We keep industry because we’ll group by it in Step 2. We also calculate the average daily market capital cap, which will be used as the weight in Step 2.
Step 2: calculate the industry-level average sd.

3.41. ⭐️ For each stock, we can regress its daily returns on the market returns, which gives us one intercept term and one beta. What’s the distribution of the intercept term and the beta?#

Problem Explained:

We’ll use many techniques introduced previously to solve this problem.

Let $s t k_r e t_{i} t$ be the daily return for stock $i$ on day $t$ and $m k t_r e t_{i t}$ be the market daily return. This problem first requires us to run the following regression:

s t k_r e t_{i t} = α_{i} + β_{1 i} \cdot m k t_r e t_{i t} + ε_{i}

Then, the problems asks about the distribution of $α_{i}$ and $β_{1 i}$ . There’re many ways to describe the distribution of a variable. In this problem, we focus on the following three things:

What’s the mean and variance of $α_{i}$ and $β_{1 i}$ ?
Plot the density of $α_{i}$ and $β_{1 i}$ .

alpha_beta = data[,
    # Step 1: calculate the daily market return
    ':='(mkt_ret=weighted.mean(retx, cap, na.rm=T)),
    keyby=.(date)
    ][,
    # Step 2: get the \alpha and \beta
    {
        lm_out = lm(retx ~ mkt_ret)
        alpha = coef(lm_out)[1]
        beta = coef(lm_out)[2]
        list(alpha=alpha, beta=beta)
    },
    keyby=.(permno)
    ][
    # step 3: remove extreme values
    alpha %between% c(quantile(alpha, 0.02), quantile(alpha, 0.98))]

alpha_beta[1:3]

alpha_beta[, 
    # Step 4: calculate the statistics of alpha and beta
    .(alpha=mean(alpha, na.rm=T), beta=mean(beta, na.rm=T),
      sd_alpha=sd(alpha, na.rm=T), sd_beta=sd(beta, na.rm=T))
    ]

library(ggplot2)

# Step 5: plot the distribution of alpha and beta
alpha_beta %>% 
    ggplot(aes(x=alpha)) +
    geom_density()

alpha_beta %>% 
    ggplot(aes(x=beta)) +
    geom_density()

Show code cell output Hide code cell output

A data.table: 3 × 3
permno	alpha	beta
<dbl>	<dbl>	<dbl>
10026	7.201256e-05	0.2987578
10028	2.093678e-03	0.6188396
10032	-1.450841e-03	0.8415072

A data.table: 1 × 4
alpha	beta	sd_alpha	sd_beta
<dbl>	<dbl>	<dbl>	<dbl>
-0.0005043781	1.056732	0.003975075	0.857834

_images/1f2a4ffd285809da66d7c24af44d19658981e2a99b7aaa763c5be3b0059c4bf7.png

_images/7a371342d649a50b3e6a36ab7454ef73e54ccb547772807f37d71252bba77db5.png

Execution Explained:

Step 1: calculate the daily market return.
Step 2: calculate the alphas and betas. The technique is introduced in this problem.
Step 3: we remove extreme values beyond the 2% and 98% percentiles. We do this to make the following visualization more clear. In real world you may not want to this.
Step 4: calculate the statistics of alpha and beta.
Step 5: plot the density of alphas and betas using ggplot2. This is the first time we use visualization in this chapter. A quick start of ggplot2 can be found here.

3.42. ⭐️ Find out the stocks whose abnormal returns rank top 100 for all of the previous three days#

Problem Explained:

This problem is a combinition of problem “calculate the daily abnormal return” and prolbem “find out stocks that outperform the market in the past three days”. It requires us to:

First, compute the daily abnormal return for each stock.
Second, find out stocks whose abnormal return rank top 100 in each of the past three days.

Since you’ve already learned all the necessary techniques. Try it before continuing to read!

# Section 1: calculate daily abnormal return

daily_abnormal_return = data[,
    ':='(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(date)
    ][, .SD[.N>=50], keyby=.(permno)
    ][!is.na(retx)
    ][, {
        lm_out = lm(retx ~ ret_mkt)
        ar = resid(lm_out)
        list(date, ar)
    },
    keyby=.(permno)]

daily_abnormal_return[1:3]

A data.table: 3 × 3
permno	date	ar
<dbl>	<date>	<dbl>
10026	2023-01-03	0.012975042
10026	2023-01-04	-0.004386641
10026	2023-01-05	-0.007474872

Execution Explained:

Calculate the daily abnormal return for each stock. See this problem for details.

# Section 2: find out stocks whose abnormal return rank top 100 for all of the past three days

daily_abnormal_return[
    # Step 1: find out the stocks whose abnormal return ranks top 100 on each day
    order(date, -ar), .(permno_list=list(permno[1:100])), keyby=.(date)
    ][, 
    # Step 2: compute the number of stocks that meet the criteria
    {
    num_stk_3d = vector()  # store the number of stocks on each day
    stk_list_3d = list()  # store the list of stocks on each day

    for (i in 4:.N) {
        # `stks` is the list of stocks that meet the criteria on each day
        stks = Reduce(intersect, permno_list[(i-3):(i-1)])

        stk_list_3d[[i]] = stks  # don't use "[i]", you must use "[[i]]"
        num_stk_3d[i] = length(stks)
    }
    list(date, num_stk_3d, stk_list_3d)
    }
    ][!is.na(num_stk_3d)
    ][1:3]

A data.table: 3 × 3
date	num_stk_3d	stk_list_3d
<date>	<int>	<list>
2023-01-06	6	19969, 89411, 16178, 16303, 22797, 15059
2023-01-09	2	88626, 17906
2023-01-10	3	21089, 15779, 17906

Execution Explained:

The main part of the code is very similar to this problem. I only highlight some differences:

Step 1: We first sort by permno and date, then we group by permno. As a result, the first 100 rows in each group will have the largest abnormal return (list(permno[1:100])).
Step 2: In this problem, I not only want to count the number of stocks that meet our criteria, but also want to return the permno of these stocks. To this end, I created two variables, num_stk_3d and stk_list_3d, to store them separately.

3.43. ⭐️ Final Exam: Find out the stocks whose abnormal returns rank top 30% in the industry in all of the previous three days#

Problem Explained:

This is the final exam, try to solve it as a test of your learning outcome!

This problem requires us to:

First, calculate the daily abnormal return of each stock.
Second, find out the stocks whose abnormal return rank top 30% in its industry in all of the previous three days.

# Section 1: calculate daily abnormal return

daily_abnormal_return = data[,
    ':='(ret_mkt=weighted.mean(retx, cap, na.rm=T)), keyby=.(date)
    ][, .SD[.N>=50], keyby=.(permno)
    ][!is.na(retx)
    ][, {
        lm_out = lm(retx ~ ret_mkt)
        ar = resid(lm_out)
        list(date, ar, industry=industry[1])
    },
    keyby=.(permno)]


# Section 2: find out stocks whose abnormal return rank top 100 for all of the past three days

daily_abnormal_return[
    # Step 1: find out the stocks whose abnormal return ranks top 30% in the industry on each day
    order(date, industry, -ar), .(permno_list=list(permno[1:(0.3*.N)])), keyby=.(date, industry)
    ][,
    # Step 2: (important!) unpack and repack the permno list by date
    .(permno_list=list(unlist(permno_list))), keyby=.(date)
    ][, 
    # Step 3: compute the number of stocks that meet the criteria
    {
    num_stk_3d = vector()  # store the number of stocks on each day
    stk_list_3d = list()  # store the list of stocks on each day

    for (i in 4:.N) {
        # `stks` is the list of stocks that meet the criteria on each day
        stks = Reduce(intersect, permno_list[(i-3):(i-1)])

        stk_list_3d[[i]] = stks  # don't use "[i]", you must use "[[i]]"
        num_stk_3d[i] = length(stks)
    }
    list(date, num_stk_3d, stk_list_3d)
    }
    ][!is.na(num_stk_3d)
    ][1:3]

Show code cell output Hide code cell output

A data.table: 3 × 3
date	num_stk_3d	stk_list_3d
<date>	<int>	<list>
2023-01-06	172	16981, 85726, 16019, 77902, 23297, 65752, 24459, 93423, 14045, 20848, 77300, 13168, 19813, 89167, 17875, 89002, 93434, 13860, 16802, 14317, 14716, 76279, 79909, 92089, 19920, 90164, 20338, 21286, 12448, 20115, 19859, 21598, 80960, 66157, 15580, 17369, 91413, 16880, 20178, 18086, 64995, 22618, 71563, 21277, 58246, 15064, 23473, 14574, 59408, 79851, 16964, 18043, 79323, 21361, 81593, 18802, 18838, 18783, 88284, 34666, 82281, 88372, 16049, 63773, 14490, 16975, 75320, 22092, 18134, 16697, 26403, 15087, 81696, 78664, 20447, 14889, 19969, 89411, 19559, 16178, 16303, 19621, 22797, 80759, 92594, 18089, 15059, 77202, 14543, 16773, 92486, 91416, 72304, 78908, 77437, 18664, 13410, 19993, 22312, 77496, 91471, 14185, 17906, 88626, 22240, 85072, 17977, 20207, 13328, 83651, 93004, 88332, 14442, 90101, 15177, 19561, 91359, 20057, 22351, 91615, 19176, 84348, 19476, 27909, 22293, 15371, 25419, 14459, 92648, 93345, 90558, 13947, 92570, 43553, 88319, 89952, 82800, 22897, 22899, 81774, 76711, 21866, 18203, 17825, 19994, 21538, 13537, 15642, 22772, 23267, 18677, 89941, 78988, 21120, 91547, 18915, 20637, 20823, 89908, 67467, 12880, 14182, 20682, 18051, 15506, 16008, 19568, 82702, 20578, 31668, 78690, 13577
2023-01-09	111	16981, 85726, 77902, 20545, 22255, 15273, 18514, 21608, 93434, 13860, 20338, 19002, 20115, 14229, 19920, 76136, 85860, 20464, 23473, 14574, 10933, 34746, 17357, 23477, 82775, 17880, 52936, 58246, 91413, 18043, 91176, 14127, 34666, 18415, 81761, 92528, 88372, 16049, 22092, 16697, 65875, 13757, 14811, 88626, 16924, 14937, 16773, 17906, 23269, 14823, 89411, 19621, 19559, 88332, 11654, 15177, 20288, 20207, 83462, 17358, 21750, 90520, 52396, 78908, 13936, 14536, 19993, 90558, 86778, 66384, 17139, 88597, 19561, 79586, 34817, 21072, 17372, 80759, 11275, 91606, 14543, 17281, 20581, 41188, 22897, 16689, 89952, 81774, 82800, 23273, 13537, 19313, 17885, 20292, 21538, 15057, 23267, 21866, 91547, 19555, 77515, 19494, 67467, 20682, 20823, 16008, 23056, 78690, 17105, 19142, 16084
2023-01-10	100	85726, 16981, 77902, 20545, 75034, 21608, 19880, 18058, 22255, 15273, 21685, 17361, 91498, 20452, 90454, 14574, 20115, 76136, 69649, 91413, 89443, 16548, 17151, 82486, 19002, 14229, 23477, 20451, 23014, 20464, 15072, 16697, 87356, 16049, 34948, 23505, 22280, 17358, 22244, 21750, 15779, 17906, 88626, 83244, 23269, 88417, 15177, 21875, 19621, 16145, 19993, 23381, 14543, 83462, 20207, 78915, 84032, 33849, 87426, 85502, 86047, 85424, 28564, 51086, 16773, 21072, 20670, 16276, 19081, 83149, 11275, 76081, 28629, 11154, 91283, 22897, 82800, 47723, 87321, 21089, 16532, 21538, 16108, 93083, 92557, 23267, 91547, 19555, 19433, 67467, 19494, 20682, 15860, 90175, 50606, 91611, 23056, 66683, 19022, 84275

40 Practices on Stock Data Processing

Contents

3. 40 Practices on Stock Data Processing#

3.1. Load Packages#

3.2. Inspect the Data#

3.3. Find out all stocks whose ticker contains "COM"#

3.4. How many stocks on each day have positive/negative returns?#

3.5. How many stocks on each day have positive/negative returns (only select stocks with market capital larger than $100M)?#

3.6. How many stocks are there in each industry on each day?#

3.7. Rank industries by their average daily number of stocks#

3.8. On each day, is the industry with the most stocks also generates the largest trading volume?#

3.9. How many stocks on each day have a gain of more than 5% and a loss more than 5%?#

3.10. What are the daily total trading volume of the top 10 stocks with the largest daily increase and the top 10 stocks with the largest daily decrease?#

3.11. How many stocks increase by 5% at the time of open and close?#

3.12. What’s the correlation coefficient between trading volume and return?#

3.13. What’s the correlation between trading volume and the magnitude of return?#

3.14. What’s the correlation between industry-level trading volume and industry-level return?#

3.15. What’s the industry distribution of S&P 500 stocks?#

3.16. What’s the return volatility of each industry?#

3.17. ⭐️ What’s the correlation matrix of different industries?#

3.18. ⭐️ What’s the largest/smallest stocks in S&P 500 on each day?#

3.19. What is the ratio of the maximum trading volume to the minimum trading volume in each industry on each day?#

3.20. What’s the sum of the dollar trading volume for the top 5 stocks in each industry on each day?#

3.21. ⭐️ In each industry, what’s the daily average return of stocks whose dollar trading volume exceeds the 80th percentile of the industry?#

3.22. What’s the correlation coefficient between the average return of the top 10% stocks with the largest dollar trading volume and the bottom 10% stocks with the smallest dollar trading volume?#

3.23. ⭐️ Find out the industries whose component stocks have a larger average dollar trading volume than the market average#

3.24. ⭐️ What’s the daily “abnormal return” of each stock?#

3.25. ⭐️ What’s the daily “self-excluding” industry return?#

3.26. ⭐️ How many industries exceed the market average return on each day?#

3.27. ⭐️ How many stocks have experienced at least one daily price increase of 5%+ in the past three days? (Count the number on each day.)#

3.28. ⭐️ How many stocks outperform the market average return on all of the past three days?#

3.29. ⭐️ How many stocks outperform the market at least for two out of the past three days?#

3.30. In each month, how many stocks outperform the market?#

3.31. Find out the three stocks that’re mostly correlated with the market#

3.32. Find out the stock that has the largest monthly decrease in trading volume#

3.33. ⭐️ What’s the maximum drawdown of each stock?#

3.34. What’s the win (success) ratio of each stock?#

3.35. What’s the win ratio of the market?#

3.36. What’s the win ratio of each industry?#

3.37. What’s the profit/loss ratio of each stock?#

3.38. What’s the profit/loss ratio of the market?#

3.39. Is there a situation where the monthly trading volume of a stock exceeds the total daily trading volume of its industry in the same month?#

3.40. What’s the standard deviation of returns for stocks in each industry?#

3.41. ⭐️ For each stock, we can regress its daily returns on the market returns, which gives us one intercept term and one beta. What’s the distribution of the intercept term and the beta?#

3.42. ⭐️ Find out the stocks whose abnormal returns rank top 100 for all of the previous three days#

3.43. ⭐️ Final Exam: Find out the stocks whose abnormal returns rank top 30% in the industry in all of the previous three days#

3.3. Find out all stocks whose ticker contains `"COM"`#