In this paper, we propose an entity centric region of interest detection and visual-semantic pooling scheme for complex event detection in YouTube-like videos. Our method is based on the hypothesis that many YouTube-like videos involve people interacting with each other and objects in their vicinity. Based on this hypothesis, we first discover an Area of Interest (AoI) map in image keyframes and then use the AoI map for localized pooling of features. The AoI map is derived from image based saliency cues weighted by the actionable space of the person involved in the event. We extract the actionable space of the person based on human position and gaze based attention allocated per region. Based on the AoI map, we divide the image into disparate regions, pool features separately from each region and finally combine them into a single image signature. To this end, we show that our proposed semantically pooled image signature contains discriminative information that detects visual events favorably as compared to state of the art approaches.