Accurate dense depth prediction for 3-D reconstruction of monocular endoscopic images plays an essential role in expanding the surgical field in robotic surgery. However, it is generally a challenge to precisely estimate dense depth due to complex surgical fields with limited field of viewing, illumination variations, and variable texture structure. This work explores the performance of convolutional networks and transformer-based networks for endoscopic depth prediction, and proposes a new architecture called densely convolved transformer aggregation networks (DCTAN) that can aggregate local texture features and global spatial-temporal features for endoscopic dense depth recovery. Specifically, DCTAN creates a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features from monocular endoscopic video sequences. Then, a local and global aggregation decoder is established to assemble the tokens of each frame to generate the global feature maps, that are integrated with the corresponding local feature maps to predict depth from coarse to fine. We trained and evaluated DCTAN through self-supervised learning on monocular synthesis (ground-truth) data and colonoscopic video images, with the experimental results demonstrating that our new architecture can extract more accurate local and global features for depth prediction and achieve more accurate depth range, more complete depth structure, and more sufficient texture information than other networks. In particular, all qualitative and quantitative assessment results of our method are better than current monocular dense depth estimation models.