- The problems of BLEU. Reference: Evaluating Text Output in NLP: BLEU at your own risk
- It doesn’t consider meaning.
- It doesn’t directly consider sentence structure.
- It doesn’t handle morphologically rich languages well.
- It doesn’t map well to human judgements.
Alternatives: NIST, ROUGE, Perplexity, WER, F-score. Specifically for sequence to sequence tasks: STM, METEOR, TER, TERp, hLEPOR, RIBES, MEWR.