Skip to content

AttributeError: 'NoneType' object has no attribute 'merge' #1308

@tiansuo-114

Description

@tiansuo-114

❓ Questions and Help

We sincerely suggest you to carefully read the documentation. After that, if you still feel puzzled, please describe the question clearly under this issue.
I encountered the following problem twice while running it. I tried it twice, but it seems the problem occurred both times:
I encountered a crash while running the Qlib factor mining scenario (Loop 10). The workflow failed during the Evaluation phase inside the .CoSTEERMultiEvaluator

It appears that when multiple tasks are running (e.g., parallel factor generation), if one task fails to return a valid feedback object (resulting in ), the function does not handle it gracefully, causing an .NonemergeAttributeError
use deepseek and 轨迹流动

File Path: factor.py

import pandas as pd
import numpy as np
from typing import Optional

def calculate_VOLATILITY_5D() -> None:
    """
    Calculate 5-day price volatility factor (non-annualized).
    
    Factor: VOLATILITY_5D
    Description: 5-day price volatility calculated as the standard deviation of 
                 daily returns over the past 5 trading days (non-annualized).
    Formulation: VOLATILITY_5D_t = sqrt(1/4 * Σ_{i=0}^{4} (r_{t-i} - r̄)^2)
    Variables:
        r_{t-i}: Daily return on day t-i, calculated as (Close_{t-i}/Close_{t-i-1} - 1)
        r̄: Average daily return over the past 5 trading days
    """
    
    # Load the data
    df = pd.read_hdf("daily_pv.h5", key="data")
    
    # Ensure the index is sorted
    df = df.sort_index(level=['datetime', 'instrument'])
    
    # Extract close prices
    close_prices = df['$close'].unstack('instrument')
    
    # Calculate daily returns: r_t = (Close_t / Close_{t-1}) - 1
    daily_returns = close_prices.pct_change()
    
    # Define lookback window (5 trading days)
    window = 5
    
    # Calculate rolling standard deviation of returns (non-annualized)
    # Using ddof=1 for sample standard deviation (divide by n-1 = 4)
    rolling_std = daily_returns.rolling(window=window, min_periods=window).std(ddof=1)
    
    # Reshape to match the required output format
    result_df = rolling_std.stack().to_frame(name='VOLATILITY_5D')
    
    # Ensure the index names are correct
    result_df.index.names = ['datetime', 'instrument']
    
    # Save to HDF5 file
    result_df.to_hdf('result.h5', key='data', mode='w')
    
    # Print summary information
    print(f"DataFrame shape: {result_df.shape}")
    print(f"Non-null count: {result_df['VOLATILITY_5D'].count()}")
    print(f"Dtype: {result_df['VOLATILITY_5D'].dtype}")
    print(f"First few rows:\n{result_df.head()}")

if __name__ == "__main__":
    calculate_VOLATILITY_5D()

2025-12-01 05:13:51.934 | INFO | rdagent.oai.backend.litellm:create_chat_completion_inner_function:162 - Using chat model deepseek/deepseek-chat
2025-12-01 05:13:51.936 | INFO | rdagent.oai.backend.litellm:create_chat_completion_inner_function:166 - assistant:
{
"code": "import pandas as pd\nimport numpy as np\nfrom typing import Optional\n\ndef calculate_OLS_RETURN_PRED_5D_INTERACTION() -> None:\n """\n Calculate OLS predicted next-day return with interaction terms using 5-day lookback.\n \n Factor: OLS_RETURN_PRED_5D_INTERACTION\n Description: Predicted next-day return from OLS regression with interaction terms\n using 5-day lookback window. The model includes interaction terms\n between return and volume, and between return and price range.\n Formulation: \hat{R}
{t+1} = \beta_0 + \sum
{i=0}^{4} \beta_{1,i} R_{t-i} + \sum_{i=0}^{4} \beta_{2,i} PR_{t-i} +\n \sum_{i=0}^{4} \beta_{3,i} V_{t-i} + \sum_{i=0}^{4} \beta_{4,i} (R_{t-i} \times V_{t-i}) +\n \sum_{i=0}^{4} \beta_{5,i} (R_{t-i} \times PR_{t-i})\n Variables:\n R_{t-i}: Daily return on day t-i, calculated as (Close_{t-i} - Close_{t-i-1})/Close_{t-i-1}\n PR_{t-i}: Price range on day t-i, calculated as (High_{t-i} - Low_{t-i})/Close_{t-i}\n V_{t-i}: Normalized volume on day t-i, calculated as Volume_{t-i}/Close_{t-i}\n R_{t-i} \times V_{t-i}: Interaction term between return and volume\n R_{t-i} \times PR_{t-i}: Interaction term between return and price range\n \beta_0, \beta_{j,i}: OLS coefficients estimated using data from days t-5 to t-1\n to predict returns from days t-4 to t\n """\n \n # Load the data\n df = pd.read_hdf("daily_pv.h5", key="data")\n \n # Ensure the index is sorted\n df = df.sort_index(level=['datetime', 'instrument'])\n \n # Unstack to get instruments as columns for each feature\n close = df['$close'].unstack('instrument')\n high = df['$high'].unstack('instrument')\n low = df['$low'].unstack('instrument')\n volume = df['$volume'].unstack('instrument')\n \n # Calculate daily returns: R_t = (Close_t - Close_{t-1}) / Close_{t-1}\n returns = close.pct_change()\n \n # Calculate price range: PR_t = (High_t - Low_t) / Close_t\n price_range = (high - low) / close\n \n # Calculate normalized volume: V_t = Volume_t / Close_t\n norm_volume = volume / close\n \n # Calculate interaction terms\n # R_t * V_t\n ret_vol_interaction = returns * norm_volume\n # R_t * PR_t\n ret_pr_interaction = returns * price_range\n \n # Define lookback window for features (5 days)\n lookback = 5\n \n # Prepare result DataFrame\n result_list = []\n \n # Iterate over each instrument (column)\n for instrument in close.columns:\n # Get series for this instrument\n ret_series = returns[instrument]\n pr_series = price_range[instrument]\n vol_series = norm_volume[instrument]\n rv_series = ret_vol_interaction[instrument]\n rp_series = ret_pr_interaction[instrument]\n \n # Create a DataFrame to align all features\n features_df = pd.DataFrame({\n 'R': ret_series,\n 'PR': pr_series,\n 'V': vol_series,\n 'R_V': rv_series,\n 'R_PR': rp_series\n })\n \n # Drop rows with any NaN (needed for regression)\n features_df = features_df.dropna()\n \n # We need at least lookback+1 points to have a training window\n if len(features_df) < lookback + 1:\n continue\n \n # Initialize predicted returns series\n predicted_returns = pd.Series(index=features_df.index, dtype=float)\n \n # Rolling OLS regression\n for i in range(lookback, len(features_df)):\n # Training window: days [i-lookback, i-1] (5 days)\n train_start = i - lookback\n train_end = i - 1\n \n # Prepare training data\n X_train = []\n y_train = []\n \n # For each day in training window, create feature vector\n for j in range(train_start, train_end + 1):\n # Feature vector for day j: [R_j, PR_j, V_j, R_V_j, R_PR_j]\n features = [\n features_df.iloc[j]['R'],\n features_df.iloc[j]['PR'],\n features_df.iloc[j]['V'],\n features_df.iloc[j]['R_V'],\n features_df.iloc[j]['R_PR']\n ]\n X_train.append(features)\n \n # Target: next day's return (R_{j+1})\n if j + 1 < len(features_df):\n y_train.append(features_df.iloc[j + 1]['R'])\n else:\n # If no next day, skip this training sample\n X_train.pop()\n \n # Need at least 2 samples for OLS\n if len(X_train) < 2:\n continue\n \n # Convert to numpy arrays\n X_train = np.array(X_train)\n y_train = np.array(y_train)\n \n # Add intercept term\n X_train_with_intercept = np.column_stack([np.ones(len(X_train)), X_train])\n \n # Perform OLS: β = (X'X)^{-1}X'y\n try:\n # Use pseudo-inverse for stability\n beta = np.linalg.pinv(X_train_with_intercept.T @ X_train_with_intercept) @ X_train_with_intercept.T @ y_train\n \n # Prepare features for prediction (day i)\n X_pred = np.array([\n features_df.iloc[i]['R'],\n features_df.iloc[i]['PR'],\n features_df.iloc[i]['V'],\n features_df.iloc[i]['R_V'],\n features_df.iloc[i]['R_PR']\n ])\n X_pred_with_intercept = np.concatenate([[1], X_pred])\n \n # Predict next day return: ŷ = Xβ\n y_pred = X_pred_with_intercept @ beta\n \n # Store prediction\n predicted_returns.iloc[i] = y_pred\n \n except np.linalg.LinAlgError:\n # If matrix is singular, skip this prediction\n continue\n \n # Create DataFrame for this instrument\n instrument_df = predicted_returns.to_frame(name='OLS_RETURN_PRED_5D_INTERACTION')\n instrument_df['instrument'] = instrument\n instrument_df = instrument_df.reset_index().set_index(['datetime', 'instrument'])\n \n result_list.append(instrument_df)\n \n # Combine all instruments\n if result_list:\n result_df = pd.concat(result_list)\n else:\n # Create empty DataFrame with correct structure\n result_df = pd.DataFrame(columns=['OLS_RETURN_PRED_5D_INTERACTION'], \n index=pd.MultiIndex.from_tuples([], names=['datetime', 'instrument']))\n \n # Ensure the index names are correct\n result_df.index.names = ['datetime', 'instrument']\n \n # Sort the index\n result_df = result_df.sort_index()\n \n # Save to HDF5 file\n result_df.to_hdf('result.h5', key='data', mode='w')\n \n # Print summary information\n print(f"DataFrame shape: {result_df.shape}")\n print(f"Non-null count: {result_df['OLS_RETURN_PRED_5D_INTERACTION'].count()}")\n print(f"Dtype: {result_df['OLS_RETURN_PRED_5D_INTERACTION'].dtype}")\n print(f"First few rows:\n{result_df.head()}")\n\nif name == "main":\n calculate_OLS_RETURN_PRED_5D_INTERACTION()"
}
2025-12-01 05:14:59.138 | INFO | rdagent.oai.backend.litellm:_create_chat_completion_inner_function:203 - Current Cost: $0.0029247500; Accumulated Cost: $0.3123617900; finish_reason='stop'
2025-12-01 05:14:59.140 | INFO | rdagent.oai.backend.base:_create_chat_completion_auto_continue:644 - Using OpenAI response format: {'type': 'json_object'}
2025-12-01 05:15:00.211 | INFO | rdagent.oai.backend.litellm:_calculate_token_from_messages:67 - Token count: 367
─────────────────────────────────────────────────────────── LocalEnv Logs Begin ────────────────────────────────────────────────────────────
Run Info
┌────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Entry │ /bin/sh -c 'timeout --kill-after=10 600 python runtime_info.py; entry_exit_code=$?; exit $entry_exit_code' │
│ Local Path │ /home/osboxes/Desktop/DR/RD-Agent/git_ignore_folder/RD-Agent_workspace/adb95ef474804307909264120a2a655e │
│ Env │ PYTHONPATH:./ │
│ │ PATH:/home/osboxes/miniconda3/envs/rdagent4qlib/bin:/home/osboxes/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/… │
│ Volumes │ /tmp/full: │
│ │ /home/osboxes/Desktop/DR/RD-Agent/git_ignore_folder/RD-Agent_workspace/adb95ef474804307909264120a2a655e/workspace_cache │
└────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
=== Python Runtime Info ===
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on Linux 6.8.0-87-generic

No GPU detected (nvidia-smi not installed).
──────────────────────────────────────────────────────────── LocalEnv Logs End ─────────────────────────────────────────────────────────────
2025-12-01 05:15:01.700 | INFO | rdagent.utils.env:__run_with_retry:240 - Running time: 0.040113210678100586 seconds
2025-12-01 05:15:01.704 | INFO | rdagent.oai.backend.litellm:_calculate_token_from_messages:67 - Token count: 367
2025-12-01 05:15:02.746 | INFO | rdagent.oai.backend.litellm:_calculate_token_from_messages:67 - Token count: 367
─────────────────────────────────────────────────────────── LocalEnv Logs Begin ────────────────────────────────────────────────────────────
Run Info
┌────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Entry │ /bin/sh -c 'timeout --kill-after=10 600 python runtime_info.py; entry_exit_code=$?; exit $entry_exit_code' │
│ Local Path │ /home/osboxes/Desktop/DR/RD-Agent/git_ignore_folder/RD-Agent_workspace/db0731edf0e8411db7998acd0b93f196 │
│ Env │ PYTHONPATH:./ │
│ │ PATH:/home/osboxes/miniconda3/envs/rdagent4qlib/bin:/home/osboxes/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/… │
│ Volumes │ /tmp/full: │
│ │ /home/osboxes/Desktop/DR/RD-Agent/git_ignore_folder/RD-Agent_workspace/db0731edf0e8411db7998acd0b93f196/workspace_cache │
└────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
=== Python Runtime Info ===
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on Linux 6.8.0-87-generic

No GPU detected (nvidia-smi not installed).
──────────────────────────────────────────────────────────── LocalEnv Logs End ─────────────────────────────────────────────────────────────
2025-12-01 05:15:04.229 | INFO | rdagent.utils.env:__run_with_retry:240 - Running time: 0.03960847854614258 seconds
2025-12-01 05:15:04.234 | INFO | rdagent.oai.backend.litellm:_calculate_token_from_messages:67 - Token count: 367
2025-12-01 05:15:04.271 | INFO | rdagent.oai.backend.litellm:_calculate_token_from_messages:67 - Token count: 2847
2025-12-01 05:15:04.272 | WARNING | rdagent.oai.backend.litellm:_create_chat_completion_inner_function:139 - Model deepseek/deepseek-chat does not support response schema, ignoring response_format argument.
2025-12-01 05:15:04.272 | INFO | rdagent.oai.backend.litellm:_create_chat_completion_inner_function:149 -
Role:system
Content: User is trying to implement some factors in the following scenario:

------Background of the scenario------
This time, I need your help with the research and development of the factor. The background of the factor scenario is as follows:
The factor is a characteristic or variable used in quant investment that can help explain the returns and risks of a portfolio or a single asset. Factors are used by investors to identify and exploit sources of excess returns, and they are central to many quantitative investment strategies.
Each number in the factor represents a physics value to an instrument on a day.
User will train a model to predict the next several days return based on the factor values of the previous days.
The factor is defined in the following parts:

  1. Name: The name of the factor.
  2. Description: The description of the factor.
  3. Formulation: The formulation of the factor.
  4. Variables: The variables or functions used in the formulation of the factor.
    The factor might not provide all the parts of the information above since some might not be applicable.
    Please specifically give all the hyperparameter in the factors like the window size, look back period, and so on. One factor should statically defines one output with a static source data. For example, last 10 days momentum and last 20 days momentum should be two different factors.

====== Runtime Environment ======
You have following environment to run the code:
=== Python Runtime Info ===
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on Linux 6.8.0-87-generic
No GPU detected (nvidia-smi not installed).
------The source dataset you can use------

daily_pv.h5

File Type

HDF5 Data File

Content Overview

Data Structure

  • Index: MultiIndex with levels ['datetime', 'instrument']

Columns

All Columns:

$high Related Columns:

$high: float32

$low Related Columns:

$low: float32

$open Related Columns:

$open: float32

$close Related Columns:

$close: float32

$factor Related Columns:

$factor: float32

$volume Related Columns:

$volume: float32

----------------- file splitter -------------

README.md

File Type

Markdown Documentation

Content Overview

How to read files.

For example, if you want to read filename.h5

import pandas as pd
df = pd.read_hdf("filename.h5", key="data")

NOTE: **key is always "data" for all hdf5 files **.

Here is a short description about the data

Filename Description
"daily_pv.h5" Adjusted daily price and volume data.

For different data, We have some basic knowledge for them

Daily price and volume data

$open: open price of the stock on that day.
$close: close price of the stock on that day.
$high: high price of the stock on that day.
$low: low price of the stock on that day.
$volume: volume of the stock on that day.
$factor: factor value of the stock on that day.

------The interface you should follow to write the runnable code------
The factor code should be written in the following interface:
Your python code should follow the interface to better interact with the user's system.
Your python code should contain the following part: the import part, the function part, and the main part. You should write a main function name: "calculate_{function_name}" and call this function in "if name == main" part. Don't write any try-except block in your python code. The user will catch the exception message and provide the feedback to you.
User will write your python code into a python file and execute the file directly with "python {your_file_name}.py". You should calculate the factor values and save the result into a HDF5(H5) file named "result.h5" in the same directory as your python file. The result file is a HDF5(H5) file containing a pandas dataframe. The index of the dataframe is the "datetime" and "instrument", and the single column name is the factor name,and the value is the factor value. The result file should be saved in the same directory as your python file.

------The output of your code should be in the format------
The factor code should output the following format:
Your output should be a pandas dataframe similar to the following example information:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 40914 entries, (Timestamp('2020-01-02 00:00:00'), 'SH600000') to (Timestamp('2021-12-31 00:00:00'), 'SZ300059')
Data columns (total 1 columns):

Column Non-Null Count Dtype


0 your factor name 40914 non-null float64
dtypes: float64(1)
memory usage:
Notice: The non-null count is OK to be different to the total number of entries since some instruments may not have the factor value on some days.
One possible format of result.h5 may be like following:
datetime instrument
2020-01-02 SZ000001 -0.001796
SZ000166 0.005780
SZ000686 0.004228
SZ000712 0.001298
SZ000728 0.005330
...
2021-12-31 SZ000750 0.000000
SZ000776 0.002459

------The simulator user can use to test your solution------
The factor code will be sent to the simulator:
The factors will be sent into Qlib to train a model to predict the next several days return based on the factor values of the previous days.
Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
User will use Qlib to automatically do the following things:

  1. generate a new factor table based on the factor values.
  2. train a model like LightGBM, CatBoost, LSTM or simple PyTorch model to predict the next several days return based on the factor values.
  3. build a portfolio based on the predicted return based on a strategy.
  4. evaluate the portfolio's performance including the return, sharpe ratio, max drawdown, and so on.

Your code is expected to align the scenario in any form which means The user needs to get the exact factor values with your code as expected.

To help you write the correct code, the user might provide multiple information that helps you write the correct code:

  1. The user might provide you the correct code to similar factors. Your should learn from these code to write the correct code.
  2. The user might provide you the failed former code and the corresponding feedback to the code. The feedback contains to the execution, the code and the factor value. You should analyze the feedback and try to correct the latest code.
  3. The user might provide you the suggestion to the latest fail code and some similar fail to correct pairs. Each pair contains the fail code with similar error and the corresponding corrected version code. You should learn from these suggestion to write the correct code.

Your must write your code based on your former latest attempt below which consists of your former code and code feedback, you should read the former attempt carefully and must not modify the right part of your former code.

Notice that you should not add any other text before or after the json format.

Please response the code in the following json format. Here is an example structure for the JSON output:
{
"code": "The Python code as a string."
}

Role:user
Content: --------------Target factor information:---------------
factor_name: OLS_RETURN_PRED_5D_LAGGED
factor_description: [Machine Learning based Factor] Predicted next-day return from OLS regression using lagged features with 5-day lookback window. The model uses features from exactly 1, 2, 3, 4, and 5 days ago (not a rolling window of all days) to predict the next day's return, reducing multicollinearity.
factor_formulation: \hat{R}{t+1} = \beta_0 + \beta_1 R{t-1} + \beta_2 R_{t-2} + \beta_3 R_{t-3} + \beta_4 R_{t-4} + \beta_5 R_{t-5} + \beta_6 PR_{t-1} + \beta_7 PR_{t-2} + \beta_8 PR_{t-3} + \beta_9 PR_{t-4} + \beta_{10} PR_{t-5} + \beta_{11} V_{t-1} + \beta_{12} V_{t-2} + \beta_{13} V_{t-3} + \beta_{14} V_{t-4} + \beta_{15} V_{t-5}
variables: {'R_{t-i}': 'Daily return on day t-i, calculated as (Close_{t-i} - Close_{t-i-1})/Close_{t-i-1}', 'PR_{t-i}': 'Price range on day t-i, calculated as (High_{t-i} - Low_{t-i})/Close_{t-i}', 'V_{t-i}': 'Normalized volume on day t-i, calculated as Volume_{t-i}/Close_{t-i}', '\beta_0, \beta_j': 'OLS regression coefficients estimated using data from days t-10 to t-1 to predict returns from days t-9 to t (using lagged features from 1-5 days before each prediction)'}

Here are some success implements of similar component tasks, take them as references:
--------------Correct code to similar factors:---------------

=====Factor 1:=====
factor_name: VOLATILITY_5D
factor_description: [Volatility Factor] 5-day price volatility calculated as the standard deviation of daily returns over the past 5 trading days (non-annualized).
factor_formulation: VOLATILITY_{5D,t} = \sqrt{\frac{1}{4} \sum_{i=0}^{4} (r_{t-i} - \bar{r})^2}
variables: {'r_{t-i}': 'Daily return on day t-i, calculated as (Close_{t-i}/Close_{t-i-1} - 1)', '\bar{r}': 'Average daily return over the past 5 trading days'}
=====Code:=====

File Path: factor.py

import pandas as pd
import numpy as np
from typing import Optional

def calculate_VOLATILITY_5D() -> None:
    """
    Calculate 5-day price volatility factor (non-annualized).
    
    Factor: VOLATILITY_5D
    Description: 5-day price volatility calculated as the standard deviation of 
                 daily returns over the past 5 trading days (non-annualized).
    Formulation: VOLATILITY_5D_t = sqrt(1/4 * Σ_{i=0}^{4} (r_{t-i} - r̄)^2)
    Variables:
        r_{t-i}: Daily return on day t-i, calculated as (Close_{t-i}/Close_{t-i-1} - 1)
        r̄: Average daily return over the past 5 trading days
    """
    
    # Load the data
    df = pd.read_hdf("daily_pv.h5", key="data")
    
    # Ensure the index is sorted
    df = df.sort_index(level=['datetime', 'instrument'])
    
    # Extract close prices
    close_prices = df['$close'].unstack('instrument')
    
    # Calculate daily returns: r_t = (Close_t / Close_{t-1}) - 1
    daily_returns = close_prices.pct_change()
    
    # Define lookback window (5 trading days)
    window = 5
    
    # Calculate rolling standard deviation of returns (non-annualized)
    # Using ddof=1 for sample standard deviation (divide by n-1 = 4)
    rolling_std = daily_returns.rolling(window=window, min_periods=window).std(ddof=1)
    
    # Reshape to match the required output format
    result_df = rolling_std.stack().to_frame(name='VOLATILITY_5D')
    
    # Ensure the index names are correct
    result_df.index.names = ['datetime', 'instrument']
    
    # Save to HDF5 file
    result_df.to_hdf('result.h5', key='data', mode='w')
    
    # Print summary information
    print(f"DataFrame shape: {result_df.shape}")
    print(f"Non-null count: {result_df['VOLATILITY_5D'].count()}")
    print(f"Dtype: {result_df['VOLATILITY_5D'].dtype}")
    print(f"First few rows:\n{result_df.head()}")

if __name__ == "__main__":
    calculate_VOLATILITY_5D()

2025-12-01 05:15:04.756 | INFO | rdagent.oai.backend.litellm:create_chat_completion_inner_function:162 - Using chat model deepseek/deepseek-chat
2025-12-01 05:15:04.757 | INFO | rdagent.oai.backend.litellm:create_chat_completion_inner_function:166 - assistant:
{
"code": "import pandas as pd\nimport numpy as np\nfrom typing import Optional\n\ndef calculate_OLS_RETURN_PRED_5D_LAGGED() -> None:\n """\n Calculate predicted next-day return from OLS regression using lagged features with 5-day lookback.\n \n Factor: OLS_RETURN_PRED_5D_LAGGED\n Description: Predicted next-day return from OLS regression using lagged features with 5-day lookback window.\n Formulation: \hat{R}
{t+1} = \beta_0 + \beta_1 R
{t-1} + \beta_2 R_{t-2} + \beta_3 R_{t-3} + \beta_4 R_{t-4} + \beta_5 R_{t-5} + \beta_6 PR_{t-1} + \beta_7 PR_{t-2} + \beta_8 PR_{t-3} + \beta_9 PR_{t-4} + \beta_{10} PR_{t-5} + \beta_{11} V_{t-1} + \beta_{12} V_{t-2} + \beta_{13} V_{t-3} + \beta_{14} V_{t-4} + \beta_{15} V_{t-5}\n Variables:\n R_{t-i}: Daily return on day t-i, calculated as (Close_{t-i} - Close_{t-i-1})/Close_{t-i-1}\n PR_{t-i}: Price range on day t-i, calculated as (High_{t-i} - Low_{t-i})/Close_{t-i}\n V_{t-i}: Normalized volume on day t-i, calculated as Volume_{t-i}/Close_{t-i}\n \beta_0, \beta_j: OLS regression coefficients estimated using data from days t-10 to t-1 to predict returns from days t-9 to t\n """\n \n # Load the data\n df = pd.read_hdf("daily_pv.h5", key="data")\n \n # Ensure the index is sorted\n df = df.sort_index(level=['datetime', 'instrument'])\n \n # Unstack to get instruments as columns\n close = df['$close'].unstack('instrument')\n high = df['$high'].unstack('instrument')\n low = df['$low'].unstack('instrument')\n volume = df['$volume'].unstack('instrument')\n \n # Calculate daily returns: R_t = (Close_t - Close_{t-1}) / Close_{t-1}\n returns = close.pct_change()\n \n # Calculate price range: PR_t = (High_t - Low_t) / Close_t\n price_range = (high - low) / close\n \n # Calculate normalized volume: V_t = Volume_t / Close_t\n norm_volume = volume / close\n \n # Initialize result DataFrame\n result = pd.DataFrame(index=close.index, columns=close.columns)\n result = result.astype(float)\n \n # For each instrument, compute the factor\n for instrument in close.columns:\n # Get series for this instrument\n ret_series = returns[instrument]\n pr_series = price_range[instrument]\n vol_series = norm_volume[instrument]\n \n # Create lagged features for lags 1 to 5\n features = []\n for lag in range(1, 6):\n features.append(ret_series.shift(lag))\n features.append(pr_series.shift(lag))\n features.append(vol_series.shift(lag))\n \n # Combine into a DataFrame: each row corresponds to day t\n # Features are from t-1 to t-5\n X = pd.concat(features, axis=1)\n X.columns = [f'R_lag{lag}' for lag in range(1,6)] + \\n [f'PR_lag{lag}' for lag in range(1,6)] + \\n [f'V_lag{lag}' for lag in range(1,6)]\n \n # Target: next day return (R_{t+1})\n y = ret_series.shift(-1) # This is R_{t+1}\n \n # Align X and y\n data = pd.concat([X, y], axis=1)\n data.columns = list(X.columns) + ['target']\n \n # For each day t, we need to estimate OLS using data from t-10 to t-1\n # to predict return for t+1 (which uses features from t-1 to t-5)\n # We'll use a rolling window of 10 days for estimation\n window_size = 10\n \n predictions = []\n \n for i in range(window_size, len(data)):\n # Estimation window: days i-10 to i-1 (inclusive)\n est_data = data.iloc[i-window_size:i]\n \n # Drop rows with any NaN in features or target within estimation window\n est_data_clean = est_data.dropna(subset=X.columns.tolist() + ['target'])\n \n # Need at least 6 observations to estimate 16 coefficients (including intercept)\n if len(est_data_clean) < 6:\n predictions.append(np.nan)\n continue\n \n # Prepare X_est and y_est\n X_est = est_data_clean[X.columns]\n y_est = est_data_clean['target']\n \n # Add intercept\n X_est_with_const = pd.DataFrame(np.ones((len(X_est), 1)), index=X_est.index, columns=['const'])\n X_est_with_const = pd.concat([X_est_with_const, X_est], axis=1)\n \n # Perform OLS: (X'X)^{-1}X'y\n try:\n XTX = X_est_with_const.T @ X_est_with_const\n # Check if matrix is invertible\n if np.linalg.matrix_rank(XTX) < XTX.shape[1]:\n predictions.append(np.nan)\n continue\n \n XTX_inv = np.linalg.inv(XTX)\n beta = XTX_inv @ X_est_with_const.T @ y_est\n \n # Features for prediction: day i (we want to predict return for i+1)\n # Features are from day i-1 to i-5 (already in X at row i)\n X_pred_row = data.iloc[i][X.columns]\n \n if X_pred_row.isna().any():\n predictions.append(np.nan)\n continue\n \n # Add intercept for prediction\n X_pred = pd.Series([1.0], index=['const']).append(X_pred_row)\n \n # Make prediction: \hat{R}_{i+1}\n pred = X_pred @ beta\n predictions.append(pred)\n \n except np.linalg.LinAlgError:\n predictions.append(np.nan)\n continue\n \n # Pad beginning with NaN (need at least window_size days)\n predictions = [np.nan] * window_size + predictions\n \n # Ensure we have predictions for all days\n if len(predictions) < len(result):\n predictions = predictions + [np.nan] * (len(result) - len(predictions))\n elif len(predictions) > len(result):\n predictions = predictions[:len(result)]\n \n result[instrument] = predictions\n \n # Stack to multi-index format\n result_df = result.stack().to_frame(name='OLS_RETURN_PRED_5D_LAGGED')\n result_df.index.names = ['datetime', 'instrument']\n \n # Save to HDF5 file\n result_df.to_hdf('result.h5', key='data', mode='w')\n \n # Print summary information\n print(f"DataFrame shape: {result_df.shape}")\n print(f"Non-null count: {result_df['OLS_RETURN_PRED_5D_LAGGED'].count()}")\n print(f"Dtype: {result_df['OLS_RETURN_PRED_5D_LAGGED'].dtype}")\n print(f"First few rows:\n{result_df.head()}")\n\nif name == "main":\n calculate_OLS_RETURN_PRED_5D_LAGGED()"
}
2025-12-01 05:16:10.636 | INFO | rdagent.oai.backend.litellm:_create_chat_completion_inner_function:203 - Current Cost: $0.0028652900; Accumulated Cost: $0.3152270800; finish_reason='stop'
2025-12-01 05:16:10.638 | INFO | rdagent.oai.backend.base:_create_chat_completion_auto_continue:644 - Using OpenAI response format: {'type': 'json_object'}

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions