Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags