One of the more common and costly mistakes I see in the many data implementations I get involved with all the time is confusion about keys. If you get it right, you end up with an efficient, high performance database. Get it wrong, and you'll end up with data quality issues everywhere, and massive maintenance costs.
It all starts with understanding the business need, which is easiest to do through business rules. In the conceptual portion of design, we find out what enties are. For the sake of this discussion, we'll be working with employees. No relation can exist in the database unless its defined. So we start with the definition:
"An Employee is any person receiving compensation for services rendered with the exception of contractors"
Is this a great definition? Probably not perfect. It's likely though.
The next part of the business rule is important though:
"An employee can be identified by their natural attribute of a social security number of employment identification number"
If, in negotiation with business stakeholders, these definitions are agreed on, SSN_EID is a candidate for a primary key. There can be no two Employees with the same SSN_EID. But that's just the start, there are other possible options as well:
"An employee can be identified by their natural attributes: First Name, Last Name, Place of Birth, and Date of Birth"
It's very unlikely two employees will have the same set of attributes for this grouping either, so it's a candidate key.
None of the attributes of a relation's key can be nullable. If a thing is identified by it's natural attributes, and one of those attributes is unknowable, that thing cannot be knowable. Logically, it cannot exist.
Either is likely useable as a primary key, but we run into physical limitations. First, most platforms perform pruning tasks much more efficiently on numeric data types. Second, compound keys are even less performant. Last, there's the pragmatic problem of spreading PII all over your database. So if we've got a compound text key, it's going to be slow. This is a physical issue, not a logical issue, it doesn't negate our previous decisions. Common practice, when dealing with these relations that have child relations, is to surrogate the key. Add a field, then populate it with globally unique arbitrary values. Generally this is done by using an auto increment integer or big integer, but sometimes using a GUID, or a result of a hash algorythm over the natural key.
Surrogating the primary key, though, doesn't get you off the hook for constraining the natural key. Even if every row in my employee table has a unique surrogate EmployeeID, I need to ensure no two employees have the same SSN/EIN. It seems simple enough, but I've seen this happen at fortune 100 companies. I have no idea the side effects. Maybe they get two paychecks? But I digress.
All good for operational systems, but how does this affect the analytics and BI space?
Many of the same rules apply to the BI/Analytics space. If I don't know the natural key of a relation I can't possibly ensure the data I'm ingesting is accurate. But things get even more complex here. For instance, a customer can be deleted from the source system, then recreated with a new surrogate key. The analytics platform should know that's the same person based on the natural attributes. For this reason, natural keys must be enforced in the BI/Analytics space, and surrogate keys should be used for secondary purposes.
Follow these basic ideas, and you'll avoid a LOT of work. Violate them, and you'll be going to finance looking for more funding.