We noticed the same thing: giving more instructions, that would better detail the task, would sometimes give better results but would fail more often. The simpler the better. But we also need to think a bit like a LLM does to even improve your results: returning the country code is harder than returning the full country name for example, as abbreviating is another layer of complexity and there is maybe less examples overall of the abbreviate names versus the full name in the data set. Choose what must have been the most prevalent item in the training data.