Urban scene modeling is a challenging task for the photogrammetry and computer vision community due to its large scale, structural complexity, and topological delicacy. This paper presents an efﬁcient multistep modeling framework for large-scale urban scenes from aerial images. It takes aerial images and a textured 3D mesh model generated by an image-based modeling system as the input and outputs compact polygon models with semantics at different levels of detail (LODs). Based on the key observation that urban buildings usually have piecewise planar rooftops and vertical walls, we propose a segment-based modeling method, which consists of three major stages: scene segmentation, roof contour extraction, and building modeling. By combining the deep neural network predictions with geometric constraints of the 3D mesh, the scene is ﬁrst segmented into three classes. Then, for each building mesh, the 2D line segments are detected and used to slice the ground into polygon cells, followed by assigning each cell a roof plane via a MRF optimization. Finally, the LOD model is obtained by extruding cells to their corresponding planes. Compared with direct modeling in 3D space, we transform the mesh into a uniform 2D image grid representation and most of the modeling work is performed in 2D space, which has the advantages of low computational complexity and high robustness. In addition, our method doesn’t require any global prior, such as the Manhattan or Atlanta world assumption, making it ﬂexible to model scenes with different characteristics and complexity. Experiments on both single buildings and large-scale urban scenes demonstrate that by combining 2D photometric with 3D geometric information, the proposed algorithm is robust and efﬁcient in urban scene LOD vectorized modeling compared with the state-of-theart approaches.